[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul by Nucs · Pull Request #611 · SciSharp/NumSharp

Nucs · 2026-04-22T09:19:43Z

Complete changelog of the nditer branch — everything in this PR since #612 merged.

455 commits · 806 files · +234,348 / −19,179 (vs master, after #612)

TL;DR

NpyIter — full port of NumPy 2.4.2's nditer (~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.
NpyExpr DSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the public np.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPy result_type typing and fused reductions.
out= / where= / dtype= ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plus np.bitwise_and/or/xor and np.positive at the np.* surface.
NumPy-parity benchmark: geomean 1.00× at 10M elements across ~409 ops (166 faster / 171 close / 36 slower) — measured by a new official BenchmarkDotNet-vs-NumPy suite committed with the report.
36 new np.* APIs — sort, pad (11 modes), tile, median/percentile/quantile (all 13 interpolation methods) + their nan* variants, average, ptp, take/put/place, extract/compress, diagonal/trace, argwhere/flatnonzero, unravel_index/ravel_multi_index/indices, delete/insert/append, diff/ediff1d, asfortranarray/ascontiguousarray, np.multithreading.
C/F/A/K order support wired through the whole API — Shape understands F-contiguity, OrderResolver resolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.
Stride-native matmul/dot — BLIS-style GEBP GEMM absorbs arbitrary strides for all dtypes (kills a ~100× penalty on transposed inputs); fused 1-D dot is 3.5–9× faster with zero GC; opt-in multithreaded dot ~2× faster than NumPy's default on 1M vectors.
Sorting, casts & Complex finished — np.sort/np.argsort on a radix line-kernel (closes a Missing Function); a SIMD strided-cast campaign that killed the cast cliffs (15×8×15 astype matrix: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy); np.zeros via calloc/demand-zero (O(1), was ~1000× slower); the six Complex transcendentals (sinh…arctan); and bit-exact pairwise summation for sum/mean.
Deterministic memory management — atomic reference counting + IDisposable on NDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).
Differential fuzzing infrastructure — 37,445 bit-exact NumPy-comparison cases across 24 corpus tiers, a seeded random fuzzer with shrinker, a CI FuzzMatrix gate, and a nightly soak workflow.
Legacy iterator stack deleted outright — MultiIterator, the Regen-generated cast templates, and NDIterator itself (interface + class + AsIterator extensions) are all gone; every code path now iterates through NpyIter / NpyFlatIterator / GetAtIndex.
Test suite: 9,990 passed / 0 failed on net8.0 + net10.0 (+2,600 new test methods), plus the 37,445-case fuzz corpus replayed by the FuzzMatrix gate.

1. NpyIter — full NumPy `nditer` port

From-scratch C# port of NumPy 2.4.2's iterator machinery under src/NumSharp.Core/Backends/Iterators/ (~12,557 lines), promoted to public API with NDArray overloads.

Capability	Detail
Iteration orders	C, F, A, K — incl. NEGPERM negative-stride handling, axis reordering + coalescing to full 1-D collapse
Indexing modes	`MULTI_INDEX`, `C_INDEX`, `F_INDEX`, `RANGE` (parallel chunking), `GotoIndex` / `GotoMultiIndex` / `GotoIterIndex`
Buffering	Buffered casting with all 5 casting rules, windowed buffered iteration, `DELAY_BUFALLOC`, buffered-reduce double-loop (incl. `bufferSize < coreSize`)
Reductions	`op_axes` with `-1` reduction axes, `REDUCE_OK`, `IsFirstVisit`, `REUSE_REDUCE_LOOPS` slab accumulation
Overlap safety	`COPY_IF_OVERLAP` via a port of NumPy's `mem_overlap` solver (`NpyMemOverlap.cs`) — overlapping in/out operands no longer silently corrupt
Masking	`WRITEMASKED` + `ARRAYMASK` executed — the buffered window flush writes back only mask-nonzero elements; `VIRTUAL` operands (null op slots) construct with NumPy 2.x semantics
Operands / dims	Unlimited operands (NumPy caps at `NPY_MAXARGS=64`) and unlimited dimensions (NumPy caps at `NPY_MAXDIMS=64`) via dynamic allocation
APIs	`Copy`, `GetIterView`, `RemoveAxis`, `RemoveMultiIndex`, `ResetBasePointers`, `IterRange`, `DebugPrint`, fixed/axis stride queries, `GetValue<T>`/`SetValue<T>`, …
Casting parity	`NpyIterCasting.CanCast` matches NumPy's `safe`/`same_kind` lattice exactly

Validated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping, NO_BROADCAST enforcement, F_INDEX coalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOP iternext, buffered-cast Advance, ranged Reset() desync, buffer free-list corruption, the size-1 stride-0 invariant (a (1,4) view with nonzero stride corrupted RemoveMultiIndex), op_axes out-of-bounds reads on stretched size-1 axes, write-broadcast validation, PARALLEL_SAFE wiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.

Execution at NumPy speed

NpyIter isn't just correct — it is now the production execution engine: DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):

Aspect (float32)	NumSharp	NumPy	Ratio
contig sqrt 10M	2.98 ms	3.24 ms	0.92×
contig add 10M	3.91 ms	4.09 ms	0.96×
strided add 1M	319 µs	416 µs	0.77×
strided sqrt 1M	206 µs	374 µs	0.55×
strided sum 1M	109 µs	205 µs	0.53×
fused `a*b+c` 10M	4.77 ms	13.38 ms	0.36×
fused `(a-b)/(a+b)` 10M	4.12 ms	22.33 ms	0.18×

Key mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).

2. NpyExpr DSL + three-tier custom-op API

User-extensible kernel layer on top of NpyIter — the public answer to "how do I write my own ufunc":

Tier 3A — ExecuteRawIL: emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* strides, long count, void* aux).
Tier 3B — ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.
Tier 3C — ExecuteExpression: compose NpyExpr trees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons, Min/Max/Clamp/Where), plus Call() to splice any delegate/MethodInfo into a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.

This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as np.evaluate(expr[, operands][, out]):

Per-node NumPy result_type typing — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly: (i4*i4)+f8 wraps the multiply in int32 (→ 1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4, i4/2 → f8) with NumPy's exact OverflowError, and special resolvers (true_divide, arctan2, negative-integer-literal power → ValueError, bool add=OR/multiply=AND).
Fused reductions — NpyExpr.Sum/Prod/Min/Max/Mean compile a one-pass inner loop; sum(a*b) reads a and b once and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).
out= joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing through COPY_IF_OVERLAP); an EXTERNAL_LOOP guard prevents the silent count==1 slow path.
Measured (Release, 4M f64, NumPy 2.4.2): a*b+c 3.2×, (a-b)/(a+b) 6.1×, sum(a*b) 3.6×, sum f32 2.9×, i4*2+f8 3.5× faster. Permanent gate in benchmark/fusion/evaluate_bench.{cs,py}.

3. Legacy iterator stack retired

MultiIterator deleted; all callers migrated to NpyIter.Copy / multi-operand execution.
The Regen template NDIterator.template.cs + 16 generated NDIterator.Cast.* partials deleted (−3,870 LOC in one commit).
NDIterator (interface + NDIterator<T> + AsIterator extensions) deleted entirely — [Obsolete] tombstones that threw at runtime after the migration and were referenced by nothing live. Production iteration runs through NpyIter/NpyIterRef (kernels), GetAtIndex (flat reads), and NpyFlatIterator (np.broadcast(...).iters).
~400 per-dtype NPTypeCode switch sites replaced by a generic NpFunc dispatch utility.

4. C/F/A/K memory-layout support

Shape now tracks F-contiguity with NumPy-convention contiguity computation; new OrderResolver resolves C/F/A/K for every API with an order parameter.
Order support wired through: copy, array, asarray, asanyarray, *_like, astype, flatten, ravel, reshape, eye, concatenate, cumsum, argsort, tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.
New: np.asfortranarray, np.ascontiguousarray.
np.where selects C/F output layout the way NumPy does; ravel('F') of an F-contig source returns a view (was a 3,000× copy).
~68 layout bugs fixed across 9 TDD fix groups, backed by ~3,300 lines of new order tests (Sections 41–51: reductions/keepdims, matmul/dot/outer/convolve, broadcasting-from-F, manipulation, file I/O fortran_order, Decimal scalar path, fancy-write isolation, …).

5. New & completed `np.*` APIs

New functions (36):

Area	APIs
Fused / ufunc	`np.evaluate` (fused expressions — see §2), `np.bitwise_and`, `np.bitwise_or`, `np.bitwise_xor`, `np.positive`
Sorting	`np.sort` (+ `ndarray.sort`; `np.argsort` reimplemented) — radix line-kernel on NpyIter, stable, NaN-last, all axes / orders (`IterAllButAxis` drive mirroring NumPy's `_new_sortlike`)
Manipulation	`np.pad` (all 11 NumPy modes + callable), `np.tile`, `np.delete`, `np.insert`, `np.append`
Indexing/selection	`np.take`, `np.put`, `np.place`, `np.extract`, `np.compress`, `np.argwhere`, `np.flatnonzero`, `np.diagonal`, `np.trace`, `np.unravel_index`, `np.ravel_multi_index`, `np.indices`
Statistics	`np.median`, `np.percentile`, `np.quantile` (all 13 interpolation methods, tuple axis, `out=`, `keepdims`, QuickSelect engine), `np.average` (`weights`, `returned`, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M), `np.ptp`, `np.nanmedian`, `np.nanpercentile`, `np.nanquantile`
Math	`np.diff`, `np.ediff1d`
Creation	`np.asfortranarray`, `np.ascontiguousarray`
Runtime	`np.multithreading(enabled, max_threads)` — opt-in threaded kernels

Rebuilt to full NumPy 2.x parity:

np.clip — min=/max= keyword aliases, default-None bounds, NumPy 2.x dtype promotion, out= validation.
np.unique — 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning, n > Array.MaxLength fallback.
np.searchsorted — side=, sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).
np.copyto — casting=, where= masked copies at NumPy speed (was 7–72× slower).
np.asarray — copy=, like=, device=, dtype-as-string. np.concatenate — full parity + C/F fast paths. np.all/np.any — tuple-axis, out=, where=. np.expand_dims — tuple axis. np.repeat — axis= parameter. np.power — integer-power semantics, negative-exponent ValueError, crash fix.
np.broadcast — N-operand form (0..64, then unlimited — NumPy parity, was 2-operand only), live index cursor, lazy .iters, .numiter.
Engine completeness: bool/char max/min, Complex quantile, IsInf implemented (was a stub); the six Complex transcendentals sinh/cosh/tanh/arcsin/arccos/arctan implemented (hybrid BCL + C99 edge fix-ups, NumPy 2.4.2 parity — were NotSupportedException).
Full 15-dtype coverage pushed through the hot paths — the SByte/Half/Complex dtypes introduced in [new dtypes, NEP50] fully supported Half/Complex/SByte, np.* alias overhaul, NumPy 2.x type alias alignment #612 now work across every kernel family this PR touches (reductions, indexing, trace, casts, quantile, …).

out= / where= / dtype= ufunc kwargs (NumPy parity):

The kwargs present on every NumPy ufunc now span the elementwise core — binary (add, subtract, multiply, divide, true_divide, mod, power, floor_divide), unary-math (sqrt, exp, log, sin, cos, tan, abs/absolute, negative, square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise, invert, arctan2 — each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:

out joins the broadcast but never stretches (mismatched/stretchable out raise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50), out only needs a same_kind cast; the provided instance is returned (reference identity).
where must be exactly bool (mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prior out contents.
out aliasing an input is well-defined via COPY_IF_OVERLAP — add(x[:-1], x[:-1], out=x[1:]) matches NumPy exactly.
dtype= computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the bool add→OR / multiply→AND remap keyed off the final loop dtype so add(True, True, dtype=i32) = 2.

6. Linear algebra

Stride-native GEMM for all 12 numeric dtypes — BLIS-style GEBP with stride-aware packers; the 8×16 Vector256 FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxing GetValue fallback chain.
Full matmul gufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).
Fused single-pass 1-D dot — 3.5–9× faster, zero GC (was up to 446 gen-0 collections per call at 100K).
np.multithreading — opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.

7. Performance (beyond NpyIter and linalg)

Op	Improvement
Axis reductions, narrow ints	Widening SIMD (int16→int32 accum etc.): `sum(int16, axis=1)` 1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bug
`mean` (axis)	217× (Phase-0 bug surgery); `var`/`std` 21×; `count_nonzero` 20×
`np.nonzero`	IL SIMD kernel closes an 8–241× gap to NumPy
`np.where`	IL kernels for scalar-broadcast & non-contiguous (1.2–2× NumPy on broadcast conditions)
Strided 1-D unary	Fused strided-SIMD kernel: 0.55 ns/elem flat — beats NumPy at every size; strided `sqrt` reached parity via gather→tile→SIMD buffering
Strided flat reductions	Incremental-advance path: strided sum 8.3× faster (11.8× behind NumPy → 1.4×)
Comparisons	PDEP-based packed mask→bool store; broadcast/strided compares routed via NpyIter
Axis-0 reductions	Column-tiled accumulation (breaks the output RAW dependency); 8× pairwise unrolled flat reductions
Allocation	tcache-style size-bucketed buffer pool with a 1 B – 64 MiB window (covers both the small-N ufunc result and 4M+ outputs that previously paid a fresh `VirtualAlloc` + demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state; `GC.SuppressFinalize` on free; `using`/ARC adopted across `concatenate`, `allclose`, `convolve`, `tile`, `eye`, masking, shuffle, …
Casts (SIMD campaign)	Strided/gathered SIMD kernels across the full 15×8×15 `astype` matrix — `cvtt` float→int, Giesen f16↔ widen/narrow, complex deinterleave, sub-word VPSHUFB shuffles, fused VPGATHER whole-array kernels, single-pass KEEPORDER same-type copy. Cliffs eliminated: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy
`np.zeros`	`calloc` / Windows `VirtualAlloc` demand-zero — O(1) regardless of size (10M f64: 14.3 ms → ~0.01 ms, was ~1000× slower)
Broadcast-reduce	Stride-0 axes folded algebraically in the flat-reduction chokepoint (no O(D×N) materialize) — `sum(broadcast_to(...))` now ~534–700× faster, beats NumPy, bit-exact
`sum`/`mean` (float)	Bit-exact NumPy pairwise summation ported onto the per-chunk reduce path — matches `np.add.reduce` bit-for-bit (unblocks float32)
`np.any`/`np.all` (bool/char)	Reinterpret to byte/ushort → existing integer SIMD path (was a 5–12× scalar cliff); fixes a latent AVX2 32-lane mask-overflow correctness bug
Complex/Half/Decimal reductions	NpyIter chunked `ForEach` axis reductions — Decimal 5–13×, Half mean 1.6–3.7×, Complex mean 15–45×→parity; float16 negate ~10× via sign-bit flip
Casts (`float→int32`)	NumPy-faithful SIMD `cvtt`, strided/reversed/gathered variants
`np.split` family	O(1) sub-shape derivation, direct views — 1.5–4× faster than NumPy
Where/copyto/searchsorted/unique	see §5

8. Official benchmark suite + honest methodology

New cross-platform run_benchmark.py entry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).
Headline: geomean NumSharp÷NumPy = 1.00× at N=10M (166 ops faster / 171 close / 36 slower) — parity across the whole op surface at memory-bound sizes; ~1.9× at 1K where per-call dispatch dominates (tracked as the next focus).
Found and neutralized a benchmark-invalidating tooling bug: dotnet run file-based apps compile the project reference in Debug (optimizations off) even with Configuration=Release properties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assert IsJITOptimizerDisabled == false and refuse to mislead; the rule is documented.
Canonical NpyIter benchmark — a section-addressable harness covering 33 op families × {scalar/1K/100K/1M/10M}, integrated into run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.
Frontier findings — found, then fixed. Adversarial probes flagged real losses; the headline ones are now closed: np.sum over a broadcast_to view (was 54× slower) folds stride-0 axes algebraically and runs ~534–700× faster than NumPy, bit-exact; scalar np.any/np.all on bool/char (was 5–12× slower) reinterpret onto the integer SIMD path; np.zeros (was ~1000× slower) is calloc-backed. Remaining tracked items: small-N (~1K) per-call dispatch overhead and a few iterator edge cases pinned as [OpenBugs]/skipped repros. A win surfaced too: hand-rolled 8-band parallel iteration 4.7×.

9. Differential fuzzing vs NumPy (new infrastructure)

37,445 bit-exact corpus cases across 24 JSONL tiers generated from real NumPy 2.4.2 outputs: casts (full 15×15 matrix), binary arith (NEP50), div/mod/power, comparisons, unary (incl. float16 inputs + all narrow ints), reductions, NaN-aware reductions, cumulative, statistics, logic/extrema, bitwise+shift, where/place, manipulation, matmul, modf multi-output, sorting/searching, parameter sweeps, SIMD-tail boundaries (900 cases around vector-width edges), operand aliasing, and error-parity (exception-for-exception).
Seeded random fuzzer with an element-wise shrinker for minimal repros; metamorphic invariant tier (11 algebraic properties).
CI integration: FuzzMatrix gate wired into the build workflow + a new nightly fuzz-soak workflow (.github/workflows/fuzz-soak.yml).
Findings inventoried in docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).

10. Correctness — NumPy-parity bug fixes

Semantics (behavioral changes, may affect callers):

floor_divide / mod: NumPy-exact floored semantics and divide-by-zero results.
Comparisons: <= / >= now return False for NaN (IEEE/NumPy).
Flat min/max propagate NaN.
np.negative(uint) wraps modulo 2ⁿ instead of throwing; bool - bool and -bool/np.negative(bool) now throw (NumPy behavior).
Transcendental ufuncs use NEP50 width-based float promotion.
np.power: negative integer exponent raises ValueError; exact integer-power semantics.
Cast semantics aligned with NumPy across all dtype pairs (IL kernels + ConvertValue); complex→bool no longer drops the imaginary part; float→int SIMD uses truncation (cvtt) like NumPy.
Broadcasting keeps rank when a 1-D [1] meets a lower-rank operand; quantile-family dtype & bool handling; Complex np.where.
Integer reciprocal(0) is per-width exact: int32/int64 → MinValue, uint64 → 2⁶³, but 0 for int8/int16/uint8/uint16/uint32 (was MinValue/0 across the board); bool → int8.
clip/maximum/minimum: float16 signed-zero scalar tail, NaN propagation through the SIMD kernel, and correct F-contiguous/strided element pairing.
float16 axis sum accumulates in float32 (NumPy parity); Complex flat min/max return the NaN-bearing element verbatim; Complex unary math ported from NumPy's own C99 algorithms.

Crashes & corruption:

Overlapping-operand corruption eliminated iterator-wide (COPY_IF_OVERLAP, §1).
Masked iteration: a buffered WRITEMASKED write landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.
uint32 axis-sum produced wrong values past 8 distinct columns (widening-SIMD rewrite).
np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.
UnmanagedStorage/ArraySlice: CopyTo direction + bounds bugs; CloneData partial-buffer bug; scalar offset lost on Clone; buffered NpyIter.Clone shared buffers; DTypeSize reported Marshal.SizeOf instead of in-memory stride; NPTypeCode.Char.SizeOf returned 1 (real: 2); stale Decimal priority.
TensorEngine now propagates through Cast/Transpose/copy/reshape/ravel (custom engines were silently dropped).
take with out= enforces NumPy's safe-cast direction; put/place non-contiguous writeback fixes; argsort on non-C-contiguous input.
NpyIter ForEach/ExecuteGeneric/ExecuteReducing read past the end without EXTERNAL_LOOP.
np.exp2 float32-output IL kernel was malformed (InvalidProgramException); np.power with a Half exponent threw InvalidCastException; a narrowing dtype= on a complex float-ufunc segfaulted — all fixed.
Complex nansum axis reduction read uninitialized memory for ndim ≥ 3; the AVX2 32-lane any() mask overflow (byte/sbyte) returned wrong results; net8.0 complex abs and axis min/max NaN propagation corrected.

11. Memory management — ARC + `IDisposable`

NDArray now implements IDisposable backed by atomic reference counting on the unmanaged block: CAS-driven TryAddRef/Release, idempotent Dispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.
Hammered by a 15-case lifecycle suite incl. 32-thread × 1,000-op concurrency races and 50-way parallel dispose — zero corruption.
Deterministic release means hot loops no longer wait on the finalizer queue; combined with the buffer pool this removes most steady-state GC pressure (dot at 100K: 446 collections → 0).

12. `Char8` primitive

New 1-byte character type (NumSharp.Char8) — the NumPy S1/Python bytes(1) equivalent — with conversions, operators, span helpers, and 100% Python bytes API parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources under src/dotnet/ document the upstream implementations it mirrors.

13. Examples — trainable MNIST MLP

New examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIter bias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views before np.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.

14. Kernel architecture & hygiene

ILKernelGenerator split into DirectILKernelGenerator (legacy whole-array kernels, 51 partials under Kernels/Direct/) and ILKernelGenerator (NpyIter-driven per-chunk kernels — the target model matching NumPy's PyUFuncGenericFunction); migration path documented per kernel family.
All Vector128/256/512 and Math/MathF reflection centralized in VectorMethodCache / ScalarMethodCache; IL-emitted typed-field copier replaces the UnmanagedStorage.Alias switch.
24 dead kernel methods removed outright (were [Obsolete(error: true)] tombstones, referenced by nothing); dead axis-reduction SIMD paths removed.

15. Documentation

NpyIter/NDIter book: docs/website-src/docs/NDIter.md (7-technique quick reference, decision tree, memory model, gotchas) + ndarray.md.
DocFX website — Benchmarks vs NumPy: benchmarks.md (head-to-head evidence companion to the IL-generation page), benchmark-iterator.md, benchmark-matrix.md, driven by the auto-committed report artifacts.
Engineering ledgers: PERF_LEDGER.md (every optimization with before/after), NPYITER_GAPS_AND_ROADMAP.md (gap analysis vs NumPy 2.4.2 + prioritized roadmap), MIGRATE_NPYITER.md, IL-kernel playbook, fuzz findings/coverage.
Branch quality audit findings are pinned as test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs — every Tier-1 finding fixed or reproduced as an [OpenBugs] test.

16. Tests & CI

+2,600 test methods; suite now 9,990 passed / 0 failed on net8.0 + net10.0. Zero regressions maintained commit-by-commit.
New suites: np.evaluate (per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused, out= identity/cast/aliasing, fused reductions), out=/where=/dtype= parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.
CI: fuzz gate in build-and-release.yml, nightly fuzz-soak.yml, new post-release benchmark.yml (auto-commits NumPy-comparison report cards to master).
Known gaps stay visible: the still-unimplemented NumPy functions are flip/fliplr/flipud/rot90, diag, gradient, and round (np.sort is now done); small-N (~1K) per-call dispatch overhead is the headline performance focus (docs/NPYITER_GAPS_AND_ROADMAP.md); a few iterator edge cases remain pinned as [OpenBugs]/skipped repros. Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design test rather than ignored.

Breaking changes

Change	Impact	Migration
`bool - bool`, `-bool`, `np.negative(bool)` now throw	Matches NumPy	Use `^` / cast to int first
NaN `<=` / `>=` returns `False`	Matches IEEE & NumPy	Use `np.isnan` explicitly
`floor_divide`/`mod` divide-by-zero & floored results	Matches NumPy	—
`np.negative(uint)` wraps instead of throwing	Matches NumPy	—
`np.power(int, negative int)` raises `ValueError`	Matches NumPy	Use float exponents
Cast edge cases (overflow/NaN/complex→bool/float→int truncation)	Matches NumPy	—
Transcendental ufuncs: NEP50 width-based promotion	Return dtype may change	—
`np.clip`/quantile-family dtype promotion	Return dtype may change	—
Broadcast views are read-only; broadcasting keeps rank for 1-D `[1]`	Matches NumPy	`.copy()` to write
`MultiIterator` and `NDIterator` (+ `NDIterator<T>`, `AsIterator`) removed	Public types removed (threw at runtime anyway)	Use `NpyIter` / `NpyIter.Copy` / `NpyFlatIterator`
NpyIter: `MaxOperands=8` and 64-dim limits removed	None (loosening)	—
`np.copyto` unwriteable-destination error type corrected	Exception type change	—

Everything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.

Nucs · 2026-06-05T13:40:10Z

📊 Benchmark & performance — `nditer`

Two performance items on this branch: a fused strided-SIMD unary kernel, and an official NumSharp-vs-NumPy benchmark across ~all op categories at three sizes (benchmarked Core state = d01f1d63).

1. Fused strided-SIMD unary IL kernel (`d01f1d63`)

New whole-array kernel for unary ops over non-contiguous 1-D inputs: strided gather → Vector{W}.Create → unary op → contiguous store, single pass, no scratch buffer, no per-tile dispatch.

Isolated kernel, np.sqrt(a[::2]), ns/elem:

N	fused	NumPy 2.4.2	NumSharp speedup
64	0.547	4.322	7.9×
4,096	0.549	0.660	1.20×
262,144	0.557	1.419	2.55×

The kernel is size-invariant (~0.55 ns/elem at every size) while NumPy degrades 2–6× as data spills out of cache.

All 11 ops on this path — speedup vs NumPy @262K (f64):

abs        3.37×   negate  3.15×   floor   3.07×   trunc   3.03×   round  3.00×
sqrt       2.55×   rad2deg 2.41×   deg2rad 2.22×   square  2.18×   reciprocal 1.72×

Verified 22,000 bit-exact checks (fused == contiguous kernel); full unit suite 9447/0/11.

Note: this is a DirectILKernelGenerator whole-array kernel that bypasses NpyIter by design — the fusion (gather folded into Vector.Create) is incompatible with NpyIter's gather/kernel separation, which is exactly the (slower) buffered path it replaces.

2. Official NumSharp-vs-NumPy benchmark (`6038990f`)

Methodology: BenchmarkDotNet Full — 50 iterations, InProcessEmit toolchain, iteration-time capped at 25 ms — × {1K / 100K / 10M} vs NumPy 2.4.2. i9-13900K · .NET 10.0.101 · Python 3.12.12. 1,813 C# measurements → 1,111 matched comparisons.

The iteration-time cap is what makes a Full run feasible: BDN's default Throughput strategy ramps to ~8192 invocations/iteration, so a 10M-element op at 50 iters took ~25 s per case. Capping it ⇒ ~15× faster (a 30-case set went 18 min → 70 s) with all 50 iterations preserved.

Headline — geomean (NumSharp ÷ NumPy, lower = better):

        slower ◄───────── 1.0 (parity) ─────────► faster
1K    ████████████████████  1.96×   (102 win / 212 lose)
100K  ██████████████████▎   1.83×   (109 win / 196 lose)
10M   ██████████▏ ........  1.00×   (166 win /  36 lose)   ◄ PARITY

At the memory-bound 10M size NumSharp is at parity across ~409 ops (166 faster, only 36 slower). Small-size cost is the per-element dispatch + result-allocation tax (~2×).

Per-suite geomean by size:

suite	1K	100K	10M
Statistics	0.19×	0.68×	0.48× ✅
Sorting	0.41×	1.13×	0.45× ✅
Comparison	1.27×	2.22×	0.50× ✅
Bitwise	8.16×	1.16×	0.61× ✅
Reduction	0.48×	0.94×	0.91× ✅
Arithmetic	3.09×	2.62×	1.25× 🟡
Unary	3.50×	4.44×	1.53× 🟡
Creation	12.26×	2.92×	2.24× 🟠
LinearAlgebra	2.76×	1.66×	4.02× 🔴

🏆 Biggest wins (@10m, real ms):

op	dtype	NumPy	NumSharp	speedup
`average`	f32	9.60	0.94	10.2×
`nansum`	f32	14.35	1.49	10.0×
`nanprod`	f32	18.52	1.90	9.7×
`var`	f32	16.96	2.60	6.5×
`count_nonzero`	f64	22.61	3.74	6.0×
`nanmean`	f64	33.47	5.69	5.9×

🎯 Biggest gaps (@10m) — optimization targets:

op	dtype	NumPy	NumSharp	gap
`sum axis=1`	uint8	3.12	49.74	16.0×
`dot`	f64	1.23	16.46	13.4×
`matmul`	f64	0.72	4.26	5.9×
`argsort`	int32	369	2162	5.9×

→ three fronts: narrow-int axis reductions (no widening-SIMD), linear algebra (no BLAS), sort.

Per-dtype @10m (geomean):

int64 0.91  uint64 0.92  f32 0.93  f64 0.98  uint8 1.00  uint32 0.99   ◄ strong
int32 1.11  int16 1.14   uint16 1.24   bool 1.60                       ◄ weak (bool, narrow-uint)

Dtype coverage: 10 dtypes compared vs NumPy; char/decimal measured but have no NumPy peer (C#-only). SByte/Half/Complex were uncovered and have since been added to the benchmark code (48e85528) — the next run produces the full 15-dtype matrix.

Reproducibility

Reusable cross-platform runner: python benchmark/run_benchmark.py (builds C#, runs BDN per-suite, sweeps NumPy at 3 sizes, merges, archives).
Full report: benchmark/benchmark-report.md (1,311 rows).
Provenance snapshot keyed by date+hash: benchmark/history/2026-06-05_6038990f/ (manifest + report + NumPy timings).

…tier; AV→NA; one CI Folds the NDIter benchmark into the official orchestrator so there is ONE entry point and ONE report, while keeping the two harnesses distinct (they measure different things — op/dtype/N throughput vs the iterator machinery — and the NDIter harness needs internal access + section-isolation the BenchmarkDotNet in-process run can't give). run_benchmark.py — after the official (op,dtype,N) merge, runs the NDIter sheet + cards and APPENDS the sheet to benchmark-report.md as its own section (not merged — different result model). Archives nditer_results.{md,tsv} + cards into results/<ts>/. New --skip-nditer flag. This is now the single command for the whole NumSharp-vs-NumPy comparison. +10M tier (decision 1): nditer_bench.{cs,py} sweep now scalar/1K/100K/1M/10M (grid 2500x4000 = 10M exactly; pick 30 iters/3 rounds at 10M). sheet TIERS + cards pick it up automatically. AV → NA/IGNORED (decision 3): instead of silently omitting a section that crashes all retries, the sheet now records its ids NA (NumPy runs first to give the expected id set), prints an AV-POLICY header explaining the known intermittent AccessViolation is ignored, lists 'THIS RUN: NA across <sections>', shows NA cells in the per-family/dividends matrices, and excludes NA from every geomean. tsv stores NA; load/cards skip it. CI consolidation (decision 2): nditer-benchmark.yml -> benchmark.yml, now runs the ENTIRE suite via run_benchmark.py. Trigger changed from workflow_run-on- every-build to release:published (the real 'after a successful release' signal — 'Build and Release' publishes a GitHub Release on a v* tag) + workflow_dispatch, so the heavy full suite runs per-release, not per-push. Commits report + cards to master with [skip ci]. timeout-minutes: 180. The npyiter_parity_poc gather kernels and the rest of the harness methodology (Release-only, matched kernels, positive-not-copyto, section isolation) are unchanged.

…n selection Refreshes the canonical NDIter results (nditer_results.md/.tsv) and the two README cards with a full sweep that now includes the 10M cache tier, and records the AV->NA policy firing on a real run. Also documents the run_benchmark.py integration in benchmark/CLAUDE.md. What changed ------------ * 198 measured pairs (was 162), 35 of them NA. The new 10M tier adds 36 ids across the size-swept families; SIZES is now scalar/1K/100K/1M/10M end to end (bench .cs + .py grids: 10M = 2500x4000). * selection (where / a[mask] / a[mask]= / count_nz / argwhere / a[idx] / a[idx]=) hit NumSharp's known intermittent AccessViolation on EVERY retry this run, so the whole section is reported NA/IGNORED per policy and excluded from every geomean. The header now reads "198 measured pairs (35 NA)" and "AV POLICY ... THIS RUN: NA across selection."; the section renders as "(no data)" / "-" / "NA" cells instead of crashing the sweep. This is the designed crash-resilience path proven on a live run, not a regression. * Headline operation matrix: 1.17x geomean, 77 win / 53 lose over 130 cells (26 non-selection families x 5 tiers). Reductions lead (1.80x), dtypes 1.59x, elementwise 1.12x; copy/cast (0.65x) and index-math (0.70x) remain the small-N laggards already tracked as canaries. Doc --- benchmark/CLAUDE.md run_benchmark.py section now describes the appended NDIter step (aspect x tier, appended-not-merged, section-isolated, AV->NA, --skip-nditer) and points at benchmark/nditer/README.md, so the dev guide matches the wired-in integration (run_benchmark.py + benchmark.yml). Known bug surfaced (tracked, not fixed here) -------------------------------------------- The selection-section AccessViolation (0xC0000005) is an unmanaged-storage lifetime bug in NumSharp under heavy mixed alloc/free load. It is intermittent (~50% per heavy section) and uncatchable; the benchmark now degrades to NA rather than masking it. Worth a dedicated issue + fix pass.

…ted report artifacts Adds docs/website-src/docs/benchmarks.md — the DocFX page the user asked for: "the real place where we discuss and present the efforts to surpass NumPy through the power of Runtime IL Generation." It is the evidence companion to the existing IL Generation page (il-generation.md explains HOW the kernels are emitted; this page shows WHAT that buys head-to-head against NumPy). The page is driven by the artifacts the Benchmark workflow (benchmark.yml) auto-commits to master after every release: * The two 400x300 cards are embedded by absolute raw.githubusercontent master URLs (same source the README uses), so they always reflect the latest committed run rather than a pasted screenshot. Verified the docfx build keeps the URLs absolute (it does not relativize external links). * The full reports are linked on master: the iterator sheet (benchmark/nditer/nditer_results.md, which the cards render from) and the op/dtype/N matrix (benchmark/benchmark-report.md), plus the harness README and benchmark/CLAUDE.md. Content (grounded in the current committed nditer_results.md numbers): * Headline cards + a by-class geomean table (reductions ~1.8x, dtypes ~1.6x, elementwise ~1.1x parity, copy/cast ~0.65x, index-math ~0.7x). * Class-by-class discussion tying each result to the IL mechanism (4x unrolling, tree reduction, SIMD early-exit, per-(op,dtype,layout) specialization), and honest about the taxes (small-N copy/cast, all-false any() scan, bcast_reduce). * The dividends NumPy can't structurally match: expression fusion (np.evaluate, up to ~13x), kernel reuse, parallel inner loop (par8 up to ~8x), cheaper iterator construction (~2-3x vs np.nditer). * Methodology + honesty section: Release-only JIT, best-of-rounds, ratios-not- absolutes, and the AV->NA policy. * Reproduce-locally commands. Wiring: * docs/toc.yml — new "Benchmarks vs NumPy" entry right after IL Generation. * il-generation.md — cross-link from the Performance Impact section ("naive C#" table vs the head-to-head-NumPy page). * index.md — added IL Generation + Benchmarks links to Get Started. Validated with `docfx build` (build-only, metadata skipped): 0 errors, the page itself emits 0 warnings (the 84 UidNotFound warnings are api/toc.yml entries that only resolve after the metadata step, which CI runs first). benchmarks.html renders, cards resolve to absolute URLs, internal links rewrite to .html. Note: deploy is via docs.yml on push to master (paths: docs/website-src/**); this branch commit does not deploy until merged. How the page REFERENCES the auto-committed cards (raw-master URL vs bundling copies into website-src/images/) is the next thing to settle.

…FX site Two follow-ups to the Benchmarks vs NumPy page, both from user direction. 1) The two 400x300 cards now carry the whole canonical summary (modeled on the ASCII sheet the user singled out), not just one bar chart each. Everything is still COMPUTED from nditer_results.tsv, so the cards auto-update each run and NA (AccessViolation) ids are skipped. * cards/ops.png — OPERATIONS vs NumPy: headline (geomean / win-lose / cells) + by-array-size-tier bars (scalar..10M) + by-operation-class bars ranked best->worst (reductions 1.80x ... copy/cast 0.65x; wins green, the two small-N taxes red). * cards/cat.png — the IL-GENERATION DIVIDENDS, the "machinery NumPy has no equivalent for": iterator build vs np.nditer, expression fusion (np.evaluate), kernel reuse, parallel inner loop — each bar is the honest geomean with an "up to <peak>x" annotation — plus the chunk-width trend (w=4 -> w=1024) and the honest pathology canary (bcast_reduce ~52x behind, in red). nditer_cards.py rewritten: shared hbars() helper, color_of() (green/amber- parity/red), stat() for (geomean, peak), two card builders. Imports CTOR/CW/ PATH/DIVIDENDS from the sheet so the section data stays single-sourced. Captions/alt-text updated to match the new card semantics (cat.png is no longer "by op class") in README.md and benchmarks.md. 2) Full reports are now rendered INTO the site as searchable pages (user choice: "Render into the site"), in addition to being linked on GitHub: * docs/website-src/docs/benchmark-matrix.md — the op/dtype/N matrix (benchmark-report.md body under a single page H1). * docs/website-src/docs/benchmark-iterator.md — the canonical iterator sheet (nditer_results.md fenced block under a page H1). * toc.yml nests both under "Benchmarks vs NumPy"; benchmarks.md "Read the full reports" now links the on-site pages (raw files still linked on master). benchmark.yml regenerates these two pages from the just-produced reports (op matrix drops its own H1 via tail -n +2 so the page has one title; the iterator sheet has no H1), commits them alongside the report + cards, and — because the commit carries [skip ci] and the pages live under docs/website-src/** — then `gh workflow run docs.yml` to redeploy the site (added actions:write + GH_TOKEN). Validation ---------- * nditer_cards.py renders both cards; verified visually (legible at 400x300). * benchmark.yml is valid YAML (yaml.safe_load). * docfx build (build-only): 0 errors; benchmark-matrix.html + benchmark-iterator.html generate; benchmarks.html internal links to both resolve; no warning names any new page (the 82 UidNotFound warnings are api/toc.yml, resolved by the metadata step CI runs first). No docs/website/ build-output committed. Still open (deferred by the user): the card REFERENCING mechanism on the docs page (raw-master URLs today vs bundling the PNGs into website-src/images/). The redeploy chaining added here would make that swap trivial if chosen later.

… 15 Best" The op/dtype/N matrix report (benchmark-report.md, rendered into the site as benchmark-matrix.md) showcased garbage: every "Top 15 Best" row was np.copy(float64) and np.searchsorted at "0.0 / 0.0x". Three distinct bugs, all fixed. BUG 1 — searchsorted benchmark measured nothing (both sides) SortingBenchmarks.cs and numpy_benchmark.py issued a SINGLE scalar lookup (np.searchsorted(sorted, N/2)) — one O(log N) binary search, ~18ns at EVERY N, pure call overhead. Against NumPy's ~1µs Python overhead that manufactured a meaningless 50–1000x "win". Fixed: both now query the N-element array (a) into the sorted target → N binary searches, real work that scales with N. (Verified the C# benchmark project still compiles.) BUG 2 — normalize_op_name collapsed a slice-copy onto np.copy The Slicing suite's "np.copy(a[100:1000])" (a fixed 900-element slice copy, ~3.6µs at every N) was normalized by stripping ALL "[...]" — including the array-index "[100:1000]" — yielding "np.copy", which COLLIDED with the Creation full-array "np.copy(a)" in csharp_index (last-write-wins) and overwrote the real float64 measurement. THAT was the bogus "copy float64 = 0.0036ms" (not a copy bug at all; the op is fine — archived raw float64 copy@10M = 11.04ms). Fixed: only strip a space-separated " [annotation]" (\s+\[ instead of \s*\[), never index brackets attached to an identifier. Incidentally also de-collides concatenate/stack/slice variants. copy(float64) now reads its real values across all sizes (10M → 11.04ms, ratio 0.60 = a genuine win). BUG 3 — the report ranked/averaged non-credible rows as wins merge-results.py sorted "Top Best" by ratio with only a `ratio is not None` guard, so a sub-resolution NumSharp time (ratio rounding to 0.0) sorted to #1, and CSV blanked legit 0.0 via `r.ratio or ''`. Fixed with a credibility gate (classify()): a row is "negligible" (new ▫ status) when either side did <1µs of work OR the speedup exceeds 20x (NumSharp >20x faster ⇒ artifact: a view, a lazy alloc, or a dead-code-eliminated kernel). Negligible rows are EXCLUDED from Top Best/Worst and from the per-size geomean, but still listed (▫) in the per-suite tables — nothing hidden. Also: store ms at 4 / ratio at 3 decimals, show 3-decimal ms + 2-decimal ratio in the showcase (no more "0.0/0.0x"), fix the `or ''` falsy-zero in CSV, add the ▫ legend row + summary/size-table counts, and a header note stating how many rows were excluded and why. Result (regenerated from the on-disk run archive with the fixed merge): * Top Best is now real reductions/statistics wins (np.nansum 0.08x, np.percentile 0.10x, np.average 0.10x) — genuine ms on both sides. * 1233 ops → 305 faster / 255 close / 169 slower / 103 much-slower / 275 NEGLIGIBLE (the artifacts, previously ~all counted as "faster") / 126 n/a. * Top Worst surfaces a real gap: np.zeros (NumSharp eagerly zeros ~10.7ms vs NumPy lazy calloc ~0.01ms) — a legitimate optimization target, not an artifact. benchmark-matrix.md (the DocFX page) re-seeded from the corrected report; docfx build clean (0 errors). The searchsorted benchmark fix takes effect on the next CI run; the credibility gate keeps any residual artifact out of the showcase meanwhile.

… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NDIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.

…oard page Adds a new DocFX page in the nditer_results.md dashboard style (ASCII bars, geomeans, win/lose, top wins/losses) applied to the broad op × dtype × N matrix — the graph/stats/ numbers companion to the narrative benchmarks.md, with minimal prose. * benchmark/scripts/render_dashboard.py — reads the merged benchmark-report.json and emits benchmark-dashboard.md: headline geomean, BY-SIZE-TIER / BY-SUITE / BY-DTYPE bars (same bar() aesthetic as nditer_sheet.py — length 10 = parity, 20 = 2.0×), the status mix, and TOP-12 wins/losses with raw ms. Charts only CREDIBLE rows (the merge-results.py gate), so the negligible artifacts that used to dominate stay out. speedup = NumPy ÷ NumSharp. * docs/website-src/docs/benchmarks-dashboard.md — the page (title + one-line note + the ```-fenced sheet), seeded from the renderer. Nested under "Benchmarks vs NumPy" in toc.yml as "Dashboard (op matrix)", beside the full Operation matrix and Iterator sheet. * benchmark/.gitignore — ignore the benchmark-dashboard.md intermediate (the tracked form is the DocFX page), matching how benchmark-report.json/csv are handled. What it shows on the current data (honest, broad picture vs the curated nditer sheet): 0.74× geomean over 832 credible cells (305 win / 527 lose) — NumSharp trails on the full matrix but reaches parity at 10M (0.98×), and wins decisively where its IL kernels shine: statistics 2.28×, broadcasting 1.22×, reduction 1.21×; uint8 1.07×. Laggards are arithmetic/ unary/creation and bool. Top wins: nansum/percentile/average (8–13×). Top losses: np.zeros (eager-zero vs NumPy lazy calloc, ~500–880×) and argsort (~25×). Prototype scope: the page is a committed STATIC snapshot. To make it live (auto-refresh each release like the matrix/iterator pages), wire render_dashboard.py + a seed step into run_benchmark.py / benchmark.yml — deferred pending design review. docfx build is clean.

Two net8.0-only BCL semantic gaps surfaced by the fuzz differential matrix. Both behave correctly on net9.0+ (where the BCL was fixed) but produced wrong values on net8.0; worked around to match NumPy 2.4.2. 1. np.abs(complex) with an infinite component returned NaN instead of +inf ------------------------------------------------------------------------ cabs(NaN + inf*i) must be +inf (C99 hypot / npy_cabs: the infinity test precedes the NaN test). System.Numerics.Complex.Abs routes through a private Hypot whose operand ordering is NaN-unaware, so on net8.0 it returns NaN for abs(NaN+inf*i) (fixed in the .NET 9 BCL). Added Utilities/NDComplexMath.Abs(Complex): returns +inf when either component is infinite, else defers to Complex.Abs — so every finite/ NaN-only magnitude that already matched NumPy bit-for-bit is unchanged. Repointed the two cached MethodInfo handles that drive every complex-abs emit site: DirectILKernelGenerator.CachedMethods.ComplexAbs (6 IL call sites across the scalar/strided/predicate/math/decimal unary loops) and DefaultEngine.UnaryOp.s_complexAbs (NDIter Tier-3B route). Fixes 19 unary.jsonl + 1 random_smoke.jsonl fuzz cases (all layouts: contiguous / strided / transposed / broadcast / negstride). 2. ptp / amax / amin along an axis dropped NaN instead of propagating it ------------------------------------------------------------------------ The typed-struct leading/innermost axis-reduction fast paths (MinOp<T>/MaxOp<T>.Combine256/128) called raw Vector256/128.Min/Max. The x86 vminps/vmaxps these lower to return the SECOND operand on an unordered (NaN) compare; the BCL Vector{N}.Min/Max only adopted IEEE NaN propagation in .NET 9. Verified: Vector128.Max(NaN,5) == 5 on net8.0, == NaN on net10.0. So max/min/ptp over a NaN-laced axis silently lost the NaN on net8.0 (ptp axis=0 returned a finite value where NumPy = NaN). Routed MinOp/MaxOp through the existing NaNAwareMinMax256/128 helper (already used by the contiguous/strided CombineVectors paths) and wrapped that helper's float/double self-equality mask in #if NET8_0 — so net9.0+ keeps the single-instruction vmaxps with zero overhead while net8.0 gets ConditionalSelect(ordered, min/max, a+b) NaN propagation. The flat whole-array reduction kernel already emitted this via EmitVectorNaNPropagatingMinMax, so only the axis fast paths were affected. Fixes 12 stat.jsonl fuzz cases (ptp float32/float64, axis 0/1, C/F-contig). Verification: full unit suite green on BOTH net8.0 and net10.0 (9709 passed / 0 failed under the CI filter), FuzzMatrix 42/42 on both. The originally reported trunc "Could not find Truncate for Vector128" failures were already resolved in-tree by the CanUseUnarySimd #if NET8_0 guard (commit 5716f86); the leak-guard working-set tests pass locally (their CI failures were OS working-set / GC-mode noise, not a managed or unmanaged leak).

…NumSharp faster) The dashboard prototype was the odd one out: I rendered it speedup = NumPy ÷ NumSharp (>1× = faster), while the op-matrix report it is derived from — and merge-results.py — use ratio = NumSharp ÷ NumPy (<1× = faster, lower is better). Two pages off the same data with opposite conventions is exactly the faster/slower confusion to avoid. Verified first that the underlying direction is NOT a flip: counting raw milliseconds (numsharp_ms vs numpy_ms, no ratio involved), NumSharp took LESS time on 305 ops and MORE time on 526 of 832 credible ops; geomean NS/NP = 1.36. So "NumSharp trails on the broad matrix" is real (concentrated in Arithmetic = 231 slower ops, and Unary), and it matches the op-matrix report's own conclusion. The dashboard's data was right; only its convention was inverted relative to the house default. render_dashboard.py now uses NS/NP throughout: * ratio = numsharp_ms / numpy_ms; header + axis read "faster ◄ 1.0 (parity) ► slower". * HEADLINE 1.36× geomean · 305 faster / 527 slower. * by-suite / by-dtype ranked fastest→slowest (ascending ratio): statistics 0.44×, reduction 0.83×, broadcasting 0.82× now read as FASTER; creation 2.83× / unary 2.63× / bool 3.55× as slower. * status bands relabelled to NS/NP (faster ≤1.0× / close 1–2× / slower 2–5× / much >5×). * tables renamed FASTEST / SLOWEST; each row shows the NS/NP ratio plus a human factor ("0.079× (12.6× faster)", "880.9× (881× slower)") so the small-ratio-is-good direction is unambiguous. benchmarks-dashboard.md re-seeded with the matching note; docfx build clean. This makes the report + dashboard consistent. The narrative benchmarks.md, the nditer iterator sheet, and the README cards still use the speedup (NP/NS, >1× = faster) framing — flipping those is a separate call (they are win-showcases where >1× reads naturally).

…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NDExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NDIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.

… stat Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp faster — bars grow right = faster, the original visual), AND every row now also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP. render_dashboard.py: * r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time). * headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g. statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum 12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP. * status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower 200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column narrow (NN% under 1000%, else NN×NP). benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy as a format spec); docfx build clean, emoji verified present (U+1F550 ×54). Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py) still uses NS/NP "lower is better", and the nditer sheet / cards use NP/NS without the %NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step, pending confirmation.

Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses (30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%". Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was "lower is better"): * ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV). * get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower 0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20). * Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a 🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×). * Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled. nditer sheet (nditer_sheet.py) + cards (nditer_cards.py) — already NP/NS, ADDED 🕐 %NumPy: * sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time"; re-rendered nditer_results.md (--render-only, AV block intact). * cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered. Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention block, a %NumPy column to the by-class table, and a caption sentence. DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard (render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af), so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README — is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.

The clock sat before the figure with the right-align padding landing between them ("🕐 87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding now sits before the number (where it belongs) instead of after the emoji. * render_dashboard.py / nditer_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated. * merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column, Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated. * benchmarks.md + README: convention line / table column / caption "%NumPy🕐". * DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐". docfx build clean. The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu has no clock glyph — so there was never a gap to fix there).

… form pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads "87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of NumPy's time like every other cell. The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded, docfx build clean.

…g from the report The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K) read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs 87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between the two committed surfaces. Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] / r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the report. The report is correct (true ratio of the measured means); the dashboard was a rounding artifact of its own recompute. Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON (the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans key off the same stored ratios the report's Summary-by-size uses. Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page; header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%, quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.

…not run" cells normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark not run" for ops that WERE run. Three archive-safe alias passes (applied identically to both sides, so they only ever merge a true pair): * empty "()" — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten" * "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D" * np.around — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy emits np.round, so the whole np.round family was ⚪ despite real data Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape 2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions (no previously-matched cell lost), 0 new key collisions. Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells, 0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved byte-for-byte). The dashboard stays cell-consistent with the report via the canonical ratio/pct fix from the prior commit. NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/ maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean on uint*/int16 lack a counterpart on one side.

…lex (NumPy parity) These six complex ufuncs previously threw NotSupportedException from the EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires them up with full NumPy 2.4.2 parity. Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2 pattern): a bit-exact probe over a finite battery showed System.Numerics. Complex matches NumPy to a few ULP on the finite interior, but diverges at 86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs instead of the C99 Annex G values, drops the sign of zero on branch cuts, and mishandles arctan's imaginary-axis cut. So: - NDComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite interior to the BCL and add the C99 fixups: * Non-finite inputs: special-value tables ported from NumPy's msun npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)). * Branch-cut/signed-zero fixups (empirically derived against NumPy and verified on a 64-point signed-zero grid): asin negates Re on x=-0 and Im on y=-0; acos negates Im on the y=+0 cut; atan sets Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0. * Where this NumPy build's system libm diverges from msun at infinities (sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y) imaginary part, tanh's sign(y) zero, and the genuinely-unspecified zero signs), the helpers match the observed NumPy 2.4.2 output. - DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh, Asin,Acos,Atan} (pointing at NDComplexMath, not Complex.* directly) and add the six cases to EmitUnaryComplexOperation. Verification: a bit-exact harness over a 117-point battery (finite + signed zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy 2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within <=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate than NumPy's log1p-based one; an accepted, documented divergence. Tests: - NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch cuts, signed zeros, and C99 special values. - Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held to a tight 4-ULP gate (a real regression fails) instead of the blanket complex-unary excuse; arctan stays under the documented blanket for its accepted BCL-interior divergence. All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus cases for the five tightly-gated ops are all within 4 ULP.

…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NpyIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NpyIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).

… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NpyIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.

…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.

…ndentals Adds AggressiveInlining/AggressiveOptimization to the complex hyperbolic and inverse-trig helpers and restructures them into a hot/cold split, so the JIT folds the per-element math into the IL-emitted unary kernel without a call frame: - Sinh/Cosh/Tanh/Asin/Acos/Atan (+ Abs and the tiny IsNegZero/IsPosZero/ HypotInf/ClogLarge helpers) are marked AggressiveInlining. Each public op is now a tiny finite-path wrapper (finite check -> Complex.* + fixups, or a cold-helper call) so it fits the inliner's budget. - The non-finite C99 special-value tables move into cold helpers (SinhSpecial/CoshSpecial/TanhSpecial/CasinhNonFinite/CacosNonFinite/ CatanhNonFinite) marked AggressiveOptimization -- kept out-of-line (so the hot wrapper stays inlineable) and fully optimized when actually hit. Behavior is identical to the prior inline form (verified below). IL-inlining experiment (the "emit the formula instead of call" question): benchmarked complex sinh both ways over 4M finite elements, median of 15 reps. The real-decomposition formula (Math.Sinh(x)*Math.Cos(y), Math.Cosh(x)*Math. Sin(y)) is bit-identical to Complex.Sinh (0/4M mismatches) but only 1.15x faster than the call; cosh 1.06x; asin/acos/atan have no real-Math.* formula (dominated by complex log/sqrt) so inlining would only drop a wrapper frame. The per-element cost is dominated by the transcendental itself, so emitting ~6 hand-written IL formulas is not worth the duplication/risk -- especially as the call-based kernel is already ~1.56x faster than NumPy 2.4.2 (np.sinh: 26.1 ns/elem vs NumPy 40.9). Decision: keep the handwritten methods; the inlining attributes capture the (small, safe) wrapper-elimination gain. Verified: NewDtypesUnaryTests + Fuzz UnaryExtra (4-ULP complex gate) green (62/62); the hot/cold split changes no results.

…exponential The np.allclose / np.random.exponential working-set leak guards (np.allclose.UsingTests, np.random.exponential.UsingTests) failed in CI with hundreds of MB of working-set growth (e.g. 551 MB on Linux, threshold 20 MB), while passing on Windows. Root cause: both functions allocate several NDArray intermediates per call and never dispose them — the unmanaged buffers ride the finalizer queue instead of being released synchronously. In a tight loop the managed wrappers are tiny so the GC rarely runs, leaving the intermediates LIVE between collections; the allocator can't reuse that memory, so the high-water mark balloons. On glibc (Linux) freed pages are retained in the arena, so the process RSS stays high even after the test's final GC.Collect()+WaitForPendingFinalizers() — hence the large WorkingSet64 delta. np.isclose: |a-b| <= atol + rtol*|b| materialized ~5 float64 temps (≈400 KB each at 50K elements) plus several bool temps, none disposed. Wrapped every fresh allocation in `using` (the elementwise operators/ufuncs each return a new array). x/y come from astype(copy:false), which returns the input itself when no conversion is needed, so they are caller-owned and never disposed here. The final combined array is captured in `using` too: MakeGeneric<bool>() takes its own refcount on the shared buffer, so disposing the backing temp on return keeps the result alive while keeping that last buffer off the finalizer queue. np.random.exponential: β·(-log(1-U)) left uniform, (1-U) and negate() intermediates un-disposed (only the log result was released). Now disposes all of them; only the trailing `* scale` allocates the fresh array returned to the caller. Effect (measured, peak WorkingSet64 growth across a 1000-iter no-GC loop — the CI failure mode): allclose 551 MB -> 3 MB, exponential -> 1 MB. Behavior is unchanged: full suite green on net8.0 and net10.0 (9718 passed / 0 failed under the CI filter), including the Logic fuzz corpus and the isclose/allclose/ exponential unit tests.

…ity with NumPy 2.4.2) The complex (complex128) overloads of the unary math ops deferred to System.Numerics.Complex for their finite interior. The BCL transcendentals diverge from NumPy on a wide range of edge inputs — large magnitudes, the unit circle, tiny/subnormal values, branch cuts and signed zeros — because they do NOT implement the careful FreeBSD msun algorithms NumPy uses. This replaces those deferrals with direct ports of NumPy's own routines in NDComplexMath, verified by a 504-point bit-exact sweep (Python struct-packed int64 references) classifying every result as exact / <=3 ULP / signed-zero / special / sign-flip. Result: 18 of 20 complex unary ops are now at full parity (0 divergence beyond <=3 ULP): exp, log, log10, log2, log1p, expm1, exp2, sin, cos, tan, sqrt, square, reciprocal, negative, sinh, cosh, tanh, arcsin, arctan. Algorithms ported / fixes (src/NumSharp.Core/Utilities/NDComplexMath.cs): - Log (npy_clog): real part = log|z| with the four-regime rescale — |z| huge (x2 down), subnormal (x2^53 up), near the unit circle (0.71<=|z|<=1.73 uses 0.5*log1p((m-1)(m+1)+n^2) via a Goldberg MathLog1p), and 0. Complex.Log cancels the real part to 0 near |z|=1 (e.g. log(1+1e-10 i).real must be 5e-21, not 0). ComplexLog is repointed here, so np.log, np.log2 and np.log10 all inherit the accuracy. - Tanh (npy_ctanh): Kahan's algorithm (t=tan(y); beta=1+t^2; s=sinh(x); rho=sqrt(1+s^2); tanh=(beta*rho*s + i t)/(1+beta*s^2)) plus the |x|>=22 overflow-safe branch. The BCL Complex.Tanh drifts ~33 ULP (tan(1.5) through the tan(z)=-i*tanh(iz) identity). - Sin/Cos/Tan: now route ALWAYS through Sinh/Cosh/Tanh, exactly as NumPy defines npy_csin/ccos/ctan (= -i*sinh(iz) / cosh(iz) / -i*tanh(iz)), so they match NumPy bit-for-bit instead of only on the BCL's finite interior. Fixes sin(0+1e300 i).real = NaN (BCL did cosh(huge)*0); the Sinh/Cosh y==0 guard returns (sinh(x), y)/(cosh(x), x*y) so a large real no longer yields inf*0 = NaN. - Expm1 (nc_expm1): real = expm1(x)*cos(y) - 2 sin^2(y/2), imag = exp(x)*sin(y); the real expm1 fallback uses the Goldberg identity (e^x-1)*x/log(e^x) which recovers the ~10 digits exp(x)-1 cancels and avoids underflow (expm1(1e-300)=1e-300, not 0). Fixes the non-finite imaginary (expm1(+Inf+0i).imag = exp(+Inf)*sin(0) = NaN) and origin signed zeros. - Square (z*z with FMA contraction): (fma(re,re,-(im*im)), fma(re,im,im*re)). NumPy's complex multiply is FMA-contracted, so square(1e-10+1e-10 i).real = -2.275e-37 (exact re^2 minus rounded im^2) and square(1e300+1e300 i).real = -inf; Complex.op_Multiply (no FMA) returned 0 and NaN. - Atan (npy_catanh, full): atanh(x) on the real axis, atan(y) on the imaginary axis, and the log1p(4|x|/sumsq(|x|-1,|y|))/4 interior, plus _sum_squares and an exponent-classified _real_part_reciprocal (raw biased-exponent field, NOT Math.ILogB which maps 0/Inf to int.MinValue/MaxValue and overflows the subtraction). Complex.Atan cancelled / underflowed the tiny imaginary part (arctan(0+1e-10 i).imag must be 1e-10). - Exp (npy_cexp): exp(-Inf + I(Inf|NaN)).imag = copysign(0, y) so exp(-inf-inf i).imag = -0 (the system libm keeps sign(y); npy_cexp's flat (0,0) dropped it). exp2 inherits this. - Reciprocal already used Smith's nc_recip (overflow-safe, correct signed zeros). Engine wiring (DirectILKernelGenerator[.Unary.Decimal].cs): ComplexLog repointed to NDComplexMath.Log; new cached methods ComplexExpm1 and ComplexSquare; the Expm1 and Square cases in EmitUnaryComplexOperation now call the ported helpers instead of inline Complex.Exp(z)-1 / Complex.op_Multiply. Accepted residuals (pathological inputs only, documented in code + the fuzz registry): - cos/sin with a NaN imaginary part: the resulting zero's sign is C99-UNSPECIFIED; the platform libm and the npy_ccos identity pick opposite signs (2 cases). - arccos with a sub-DBL_MIN imaginary part: Complex.Acos flushes the denormal real part to 0 where cacos's _do_hard_work keeps it (~5.8e-309); a denormal-range edge (4 cases). - sinh/cosh at the overflow boundary |x| in [710, 710.13]: Windows' CRT sinh overflows to inf while .NET Math.Sinh stays finite (a platform-libm boundary, absent on glibc). Tests: NewDtypesUnaryTests.cs adds 11 NumPy-2.4.2-verified cases for the huge-imaginary sin/cos, large-real sinh/cosh overflow, Kahan tan accuracy, near-unit-circle clog, scaled log10/log2/log1p, Goldberg expm1, exp2, FMA square, reciprocal signed zeros, catanh tiny/large arctan, and the exp -inf signed zero. Fuzz/MisalignedRegistry.cs tightens the complex-unary gate to <=3 ULP across the whole set (was a 4-ULP gate on 5 ops + a blanket excuse for the rest), narrows the >3-ULP excuse to the named pathological ops, and adds a separate entry for the (pre-existing) complex reduction/scan NaN-ordering divergence the old blanket covered. Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9729 passed, 0 failed. net8.0 + net10.0 both build clean.

…-run Adds the benchmark definitions that were missing on one side of the op-matrix join (so the ops showed ⚪ "not run" or were discarded as C#-only), then re-runs the whole official suite (all 14 comparison suites x 3 cache tiers, ~3h) to fill them in with live numbers. Result: ⚪ no-data 130 -> 76, and the headline moves from a stale 0.74x to 1.08x geomean (93%🕐) over 1386 credible cells — the stale figure was dragged down by a broken searchsorted and by simply missing most of NumSharp's fast reductions. NumPy side (numpy_benchmark.py) — C# already benchmarked these; NumPy didn't: * unary: np.tan, np.exp2, np.expm1, np.log2, np.log1p, np.clip(a,-10,10), np.power(a,2|3|0.5) * reduction: np.cumsum (all arithmetic dtypes), np.prod + np.prod axis=0/1, and the axis variants np.amax/np.amin/np.mean axis=0(/1) and np.var/np.std axis=0 All names normalize to the existing C# [Benchmark(Description=...)] so they join 1:1. C# side: * ProdBenchmarks: was non-standard sizes (100/1000/10000) + method-form names (a.prod()); nothing could join it. Switched to the standard Small/Medium/Large tiers and function-form np.prod(a)/np.prod(a, axis=k) — values stay in [0.5,1.0] so the product is overflow-safe at every size. prod now has full + axis coverage (18 cells). * MeanBenchmarks: CommonTypes -> ArithmeticTypes, closing the np.mean uint*/int16 ⚪ holes (15 cells) — matches SumBenchmarks/MinMaxBenchmarks. * LogicBenchmarks: isnan/isinf/isfinite/maximum/minimum/array_equal now join (54 cells). Verified on the fresh run: searchsorted is purged of the 0.0000ms / >1e6x rows (now real, 1.16-1.44x faster), prod/cumsum/all axis reductions/the 6 predicates/mean-on-uint* all matched. Regenerated benchmark-report.{md,json,csv} + dashboard and re-seeded the matrix + dashboard DocFX pages. KNOWN BUG surfaced (left as ⚪): np.isclose and np.allclose DETERMINISTICALLY segfault NumSharp with the unmanaged-storage AccessViolation — each crashes even run alone, and in-class it killed the whole logic suite before BenchmarkDotNet could export anything (took the 6 working predicates down with it). Disabled both in LogicBenchmarks with a documented note; re-enable once the NumSharp isclose/allclose lifetime bug is fixed. The 6 predicates were recovered by running each in its own process (the same per-section isolation the NDIter harness uses for its AV).

…segfault) + doc review Parity review of the complex unary math overloads (commit 416affc). Verified all 20 affected ops across memory layouts (contiguous / F-contiguous / strided / transposed / both negative-stride directions / sliced-offset / broadcast / 0-d / empty) with a fresh bit-exact sweep — every op is layout-correct with 0 divergence — and confirmed the out=/where= ufunc parameters compose bit-exactly with the new complex kernels (exp returns the same out instance; sqrt's where=mask preserves masked-off slots). The review surfaced a pre-existing MEMORY-SAFETY bug (segfault), now fixed: np.exp(complex_array, dtype=float64) # and sqrt/log/log2/log10/log1p/expm1/exp2/sin/cos/tan/ # sinh/cosh/tanh/arcsin/arccos/arctan segfaulted instead of raising. Root cause: ResolveUnaryFloatReturnType honored an explicit dtype= override after only rejecting integer/bool targets (over < Single -> "No loop matching"). It never checked that the INPUT can reach the requested loop dtype by a same_kind cast. For a complex input + real-float dtype=, it returned the real type, ExecuteUnaryOp allocated an 8-byte/element output buffer, and the 16-byte/element complex kernel overran it. NumPy 2.4.2 raises instead: "Cannot cast ufunc 'exp' input from dtype('complex128') to dtype('float64') with casting rule 'same_kind'" Fix: ResolveUnaryFloatReturnType now calls the existing ValidateUnaryInputCast (already used by square/reciprocal/negative, which were NOT affected) on the override path. This reuses NDIterCasting.CanCast(SAME_KIND), so it allows the legal narrowings (int->float32, float64-> float32, float->complex) unchanged and rejects only the cross-kind complex->real cast, emitting NumPy's verbatim message. Probe matrix (complex/float/int inputs x float/complex/int dtype=) now matches NumPy across all 17 float-producing complex ufuncs; the order is preserved (integer dtype= still raises "No loop matching" before the cast check). Also refreshes the NDComplexMath class doc comment, which still described the old fork state ("sinh/cosh/tanh/asin/acos/atan delegate straight to System.Numerics.Complex", "arctan's BCL interior is the lone documented divergence") — it now lists the actual ported algorithms (npy_clog, Kahan ctanh, csinh/ccosh, npy_catanh, npy_cexp/csqrt, nc_expm1/Goldberg, FMA square, nc_recip), the two ops still delegating (asin/acos at parity), and the three accepted pathological residuals. Tests: NewDtypesUnaryTests.cs adds Complex_FloatUfunc_NarrowingDtype_RaisesCastError_NotSegfault (exp/log/sqrt/sin/tanh/arctan: complex+dtype=float64 raises the verbatim cast error, complex+ dtype=int64 raises "No loop matching", complex+dtype=complex128 returns complex). Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9730 passed, 0 failed. net8.0 + net10.0 build clean. Note: ceil/floor/round/trunc on complex reject cleanly (no segfault) but with NumSharp's own message rather than NumPy's "ufunc not supported for the input types" — left as-is (out of scope; NumPy has no complex loop for them either). The int->exp2 InvalidProgramException (Single-output kernel) remains a separate, already-tracked bug (fuzz registry W3-C), unrelated to complex.

…o (match/beat NumPy 2.4.2) np.zeros was ~1000x slower than NumPy for large arrays (10M float64: 14.3 ms vs NumPy 0.011 ms). Root cause: it allocated an uninitialized buffer and then ran an eager per-element Fill loop that touched (and zeroed) every byte. NumPy instead delegates zeroing to the OS: PyDataMem_NEW_ZEROED -> calloc, whose demand-zero pages are committed and zeroed lazily on first write, so allocating zeros is effectively O(1) regardless of size (numpy/_core/src/multiarray/alloc.c npy_alloc_cache_zero: small sizes use a cache+memset, large sizes calloc). This ports NumPy's structure. The zeroing is now done by the allocator/OS, never an element loop — correct for all 15 dtypes because the all-zero bit pattern equals default(T) for every one of them (incl. Half, Single, Double, Decimal, Complex). Implementation -------------- - ArraySlice.Allocate(..., fillDefault: true) and Allocate<T>(..., true) now route to UnmanagedMemoryBlock<T>.AllocateZeroed instead of `new UnmanagedMemoryBlock<T>(count, default)` (Take + scalar Fill). All np.zeros overloads, np.zeros_like, np.eye/np.identity, and every internal fill-with-default allocation flow through here. - SizeBucketedBufferPool.TakeZeroed: NativeMemory.AllocZeroed (calloc) with no dirty-bucket reuse — a recycled buffer would force a full memset, discarding the lazy demand-zero win for large sizes and being no cheaper than calloc for small ones. - OsVirtualMemory (new, Windows-only): the Windows process heap eager-commits and memsets mid-size calloc requests (~256 KiB-2 MiB, ~0.05 ms for 800 KiB), unlike glibc/macOS which mmap large blocks lazily. For >= 128 KiB on Windows AllocateZeroed uses VirtualAlloc(MEM_COMMIT) (copy-on-write zero pages, ~0.002 ms) and a new Disposer AllocationType.Virtual that releases straight to the OS via VirtualFree (not pooled). Non-Windows and small sizes stay on calloc, which is already lazy/cheap there. Benchmark fix (pre-existing bug) -------------------------------- CreationBenchmarks returned each created array without disposing, leaking one buffer per op. NumPy's harness (numpy_benchmark.py) discards each result inside the timed loop, so CPython refcount frees it immediately — i.e. NumPy measures alloc+free while the C# benchmark measured alloc-only (unfair) and leaked. Under BenchmarkDotNet's thousands-of-ops-per-iteration, every untouched-but-committed buffer still charges Windows commit, so any fast creation op OOMs at 10M (np.empty(10M) already did; the old np.zeros only escaped by being slow enough to throttle BDN to a couple ops/iteration). All creation benchmarks now dispose per op, matching NumPy and bounding resident memory. Results (this machine, vs NumPy 2.4.2; BDN alloc+free) ------------------------------------------------------ - 10M float64: 14.3 ms -> 0.0033 ms (was ~1000x slower; now 3.1x faster) - medium (100K): 1.7-3.8x faster across i32/i64/f32/f64 - large (10M): 1.1-3.5x faster across i32/i64/f32/f64 - small (1K): ~1.5-2x slower — bounded by NDArray object construction (NDArray/Storage/Shape/ArraySlice/Disposer), shared by all creation APIs and sub-microsecond; the allocation itself is optimal. Tests ----- New Creation/np.zeros.AllocationTests.cs (12 tests): all 15 dtypes zeroed across heap/VirtualAlloc size regimes, full-scan of a multi-MB array, VirtualAlloc writeability/commit correctness, OwnsData, non-aliasing, reuse-after-dispose, multi-dim/high-rank/empty/sliced, default dtype, all overloads. Full CI suite (net8.0 + net10.0, excluding OpenBugs/HighMemory) green: 0 failed, 9742 passed.

…o longer crashes) nd[(nd < 3)] = -2 — assigning a scalar into a boolean-mask-selected subspace — used to trip a Debug.Assert and kill the test host (the test pre-threw to dodge it). The broadcast-value assignment path (SetIndicesND scalar/broadcast handling) fixed it; the whole flow now matches NumPy: nd = [[1,2,3],[4,5,6]]; nd[nd < 3] = -2 -> [-2,-2,3,4,5,6] nd[(nd == -2) | (nd > 5)] -> [-2,-2,6] Removed the pre-throw guard and the [OpenBugs] attribute. Tests: NDArray.Indexing.Test class 123 passed / 0 failed (net8.0).

…he int[] overload The differential index oracle (NumPy vs NumSharp, 2265 getter/setter cases x layouts + 104 dtype cases) surfaced 12 setter divergences, all via the object[] single-int setter b[(object)0] = v and the long[] coordinate shim b.SetData(v, 0L): b = arange(12).reshape(3,4) b[(object)0] = -1 -> [-1,1,2,3,...] NumPy: [-1,-1,-1,-1,...] (fills row 0) b[(object)0] = [1,2] -> partial write NumPy: ValueError ((2,) into (4,)) Root cause: SetData(NDArray, params long[]) carried the OLD logic the int[] overload had before the broadcast fix — for a scalar it wrote only the FIRST element of the sub-array (not a fill), and for a larger/smaller value it linear-copied with no broadcast validation. (b[0] = v via a literal int goes through the Slice path and was always correct; only the object[]/long[] entry points were wrong.) Fixed by delegating SetData(NDArray, long[]) to the corrected SetData(NDArray, int[]) so scalar-broadcast-across-subarray, value broadcasting/tiling, and the NumPy shape-mismatch ValueError all apply uniformly. After the fix the differential sweep is 2265/2265 + 104/104 = 0 divergences (1-to-1 parity: success/failure agreement, exact shape, bit-exact gathered values). Added regression ObjectArraySingleInt_Setter_BroadcastsAndValidates. Tests: full CI suite 10977 passed / 0 failed / 11 skipped (net8.0).

…y-safety) The differential index oracle (random-fuzz layer over exotic mixed advanced-index combinations) found that over-indexing a rank-N array with more than N advanced index arrays walked strides past the end of the shape: the subshape was sized `srcShape.NDim - ndsCount` (negative) and the offset/getter loops dereferenced strides[i] beyond the array -> OOB read/write (heap corruption / OverflowException). Added a guard in FetchIndices<T> and SetIndices<T>: ndsCount > source.ndim now raises IndexError "too many indices for array: array is N-dimensional, but M were indexed" (NumPy parity) before any unsafe stride math runs. Tests: full CI suite 10978 passed / 0 failed / 11 skipped (net8.0). Note: the differential sweep still surfaces deeper mixed advanced-index divergences (multi-dim fancy + slice + 0-d bool + newaxis/empty combinations) and a separate flaky OOB in that path — tracked separately; the curated common-surface sweep (2369 cases x layouts x dtypes) remains 0 divergences.

A differential index oracle (NumPy 2.4.2 vs NumSharp, curated 2369 cases + a seeded random-fuzz layer) proved the curated common surface is bit-exact (0 divergences) but ~660-700 divergences remain across EXOTIC mixed advanced-index combinations (bool-array+fancy, multi-dim fancy+slice, 0-d-bool+fancy, multi-fancy, empty combos) plus a flaky heap-corruption crash in that path. Handover documents how to close it by porting NumPy mapping.c's unified two-stage model (prepare_index + MapIterNew/_get_transpose) to REPLACE the current per-shape Try* fast-path patchwork, which cannot generalise. Covers: - precise divergence categories (by failure mode and by index-form feature) - why the patchwork architecture cannot reach parity - the NumPy algorithm with file:line citations (mapping.c) - a phased plan: lock the gate -> hunt the OOB crash -> PrepareIndex -> unified MapIter gather/scatter -> edges/overlap - keep-vs-replace map of every existing indexing helper - the differential harness (token encoding, base recipes, run/regenerate) and how to promote it into the committed test/oracle + [FuzzMatrix] gate - memory-safety crash hunt (page-heap/GCStress to catch the delayed OOB at the write) - DOD, risks, first-day checklist Successor to advanced-index-axis-placement.md (which resolved the two-advanced+slice sub-case via TryBuildMultiAdvancedGrid).

… (Phase A) Promotes the scratchpad getter/setter differential harness into the committed oracle pipeline, per docs/plans/advanced-index-combinatorial-handover.md Phase A. This is the gate the full mapping.c port (Phases C-E) must drive to 0/0; until it is committed it cannot defend the fix. What lands: - test/oracle/gen_index_oracle.py — NumPy 2.4.2 oracle. Emits a portable TOKEN corpus (index fields encoded as [int n]/[slice ...]/[new]/[ell]/[arr flat shape]/ [barr ...]/[b0 bool]/[a0 n]; values [scalar n]|[arr ...]) across 15 base recipes (S,V0,V1,V6,A,AT,ARS,ACS,ANR,ANC,ASO,ABC,B,BT,E03) for get+set, a 13-dtype sweep, and a seeded random-fuzz layer. Writes JSONL into Fuzz/corpus/ (csproj glob copies it to test output — no Python at test time, matching the existing FuzzMatrix gates). - Fuzz/IndexOracleTests.cs — [FuzzMatrix] replay. Rebuilds the SAME base+index from tokens, runs get/set, bit-compares shape + int64 values + which-side-raised. - Three corpora: index_curated.jsonl (2265) — deterministic matrix, CI gate index_dtype.jsonl (104) — forms x 13 dtypes, CI gate index_random_20240626.jsonl (10000) — seeded fuzz, the target Gate status (reproduced, both frameworks): - Index_Curated + Index_Dtype: 0 divergences (green, run in CI as FuzzMatrix). - Index_Random: ~697 divergences (209 throws-on-valid, 404 accepts-invalid, 84 shape/value) + a flaky heap-corruption AccessViolation in the mixed-advanced path. Marked [OpenBugs] so CI excludes it (avoids the crash) until Phases C-E land; un-marked at Phase E per the handover DOD. The curated/dtype gate pins the 13 indexing fixes already on nditer (b03e40b7..998c1d23) so they cannot regress while the combinatorial port proceeds.

…pt-in page-heap (Phase B) Memory-safety hardening for the advanced-index path, per docs/plans/advanced-index-combinatorial-handover.md Phase B. Block-copy bounds guards (permanent fix) ---------------------------------------- The fancy gather/scatter copy one subShapeSize block per selected offset but the upstream bound check only validated each block's START offset, not its full extent. A miscomputed retShape/subShape (the exotic mixed-advanced combos the per-shape Try* dispatch mishandles) therefore copied past the end of a pinned/native buffer -> silent heap corruption (the flaky AccessViolation the differential sweep surfaced). Each raw block-copy / odometer / value-read site now validates the WHOLE span against the real buffer capacity (Shape.BufferSize) and throws a tagged IndexOutOfRangeException instead of corrupting: - FetchIndicesND (getter contiguous block gather) - FetchIndicesNDNonLinear (getter strided odometer gather) - SetIndicesND (+ non-linear) (setter block scatter) - SetIndices non-subshaped value read (value shorter than the selection) Shared IndexingOobMessage() names the offending copy + computed retShape/subShape so a divergence is traced to the mishandled index combination. Opt-in page-heap (diagnostic infra, zero production impact) ----------------------------------------------------------- SizeBucketedBufferPool gains a NUMSHARP_GUARD_PAGES=1 mode (Windows only, read once at startup): every pool Take hands back a buffer whose last byte abuts an inaccessible PAGE_NOACCESS guard page (OsVirtualMemory.AllocGuarded/FreeGuarded), bypassing reuse, so a one-past-the-end write into a POOL buffer faults instantly at the offending access. Default OFF — Take/TakeZeroed/Return and the np.zeros VirtualAlloc bypass keep their exact production paths. Findings (recorded for Phase C) ------------------------------- Every corruptor case the sweep names is an index combination NumPy REJECTS that NumSharp's Try* stack wrongly accepts and feeds malformed shapes to a kernel: V6[arr([3,1]), 2, arr([])] -> over-indexed 1-D (NumPy IndexError) A[barr([F],(1,)), None, barr([F,F,F,F],(4,))] -> bool length != axis (NumPy IndexError) The guard pages did not fault on these in isolation because the overrun target is a FromArray-pinned MANAGED index array (not a pool buffer), confirming the real fix is up-front validation: Phase C's prepare_index rejects these before any kernel runs, which structurally eliminates the whole OOB class. The block-copy guards above remain as a defense-in-depth backstop. No regression: 1299 indexing/selection tests + the Index_Curated/Index_Dtype gate green.

…ate (Phase C) Implements docs/plans/advanced-index-combinatorial-handover.md Phase C: a faithful port of NumPy 2.4.2's prepare_index (numpy/_core/src/multiarray/mapping.c:262 prepare_index_noarray) that classifies and VALIDATES the whole index tuple in one pass before any per-shape Try* fast path runs. This replaces the scattered, per-shape validation that let the heuristic stack accept combinations NumPy rejects and feed malformed shapes to a kernel. New file Selection/NDArray.Indexing.PrepareIndex.cs: - IndexType (NumPy HAS_* bitmask), IndexKind, IndexOp, PreparedIndex. - PrepareIndex(Shape, object[]): the classification cascade (ellipsis / newaxis / slice / integer / 0-d-bool / k-d-bool->nonzero / integer-array / 0-d-array-scalar / invalid), the ellipsis fill + HAS_SCALAR_ARRAY cleanup + a[()] special case, then the post-walk validation NumPy does once axis placement is known: * too-many-indices -> 'array is N-dimensional, but M were indexed' (mapping.c:665) * boolean array dim -> 'boolean index did not match indexed array along axis A...' (:709) * integer/array VALUE bounds -> 'index N is out of bounds for axis A with size S' * advanced block broadcast -> 'shape mismatch: indexing arrays could not be broadcast together with shapes ...' (:2617) [bit-exact message, NumPy-verified] * single ellipsis / non-integer-or-boolean array -> the verbatim IndexErrors. - Wired as a gate at the top of FetchIndices/SetIndices(object[]) for every multi-index tuple (indicesLen != 1); valid tuples pass straight through to the existing dispatch. Impact (differential random sweep, seed 20240626): - ns-accepted-invalid 404 -> ~7 for the index-structure classes (over-index, bool-length mismatch, OOB index value, un-broadcastable advanced) — the combinations that also drove the mixed-advanced heap corruption now raise BEFORE any kernel, removing that OOB source. - Total divergences ~697 -> ~440 (windowed; the residual ~89 accepts-invalid are SETTER value-broadcast cases -> Phase E, and the shape/value + rejects-valid buckets -> Phase D). - A residual flaky wrong-shape overrun into a pinned managed index array survives in the random sweep only (the [OpenBugs] gate, CI-excluded); it is one of the wrong-shape divergences Phase D's exact axis placement eliminates (handover: correct shapes => no OOB). Tests: full net8.0 suite 10980 passed / 0 failed; net10.0 indexing + Index_Curated/Index_Dtype gate green. IndexNDArray_Case10_Multi now expects IndexError (NumPy-correct; was the non-NumPy IncorrectShapeException) and GetIndicesFromSlice's reflection proxy matches by name (PrepareIndex also takes Shape as its first parameter).

Slice.ToSliceDef's negative-step branch clamped a start more negative than -dim to 0, yielding a spurious length-1 slice that began at index 0; NumPy clamps it to -1 ('before the beginning' when walking backwards), making the slice empty. arange(3)[-7::-2] NumPy [] was NumSharp [0] arange(3)[-4::-2] NumPy [] was NumSharp [0] arange(2)[-7:-3:-2] NumPy [] was NumSharp [0] In-range negative starts (e.g. [-2::-1] == [1,0], [-1::-1] == [2,1,0]) and the positive-step branch are unchanged — only an out-of-lower-bound negative start with a negative step is affected. Surfaced by the differential index sweep (13 pure-basic-slice divergences in the first 5000 random cases, now 0). Tests: 2028 indexing/slice/view + Index_Curated/Index_Dtype gate green; no regression.

…eg-stride offset bound (Phase D) Two coupled fixes that make TryBuildMultiAdvancedGrid the single advanced-index gather for ALL HAS_FANCY tuples (mapping.c MapIterNew axis placement), replacing the np.take fast path: 1. largestReachableOffset (neg-stride bound). FetchIndices/SetIndices validated gather offsets against GetOffset(size-1 corner), which for a NEGATIVE-stride view is the MINIMUM corner, not the maximum — so valid early-row offsets on a[::-1]/a[:,::-1] were rejected as out of bounds (IndexOutOfRangeException). Now bounded by the true max reachable offset (base + per-axis positive-stride contribution; == size-1 when contiguous, unchanged for positive strides). 2. Grid handles a SINGLE advanced index. TryBuildMultiAdvancedGrid required >=2 advanced axes, so a single MULTI-DIM fancy array mixed with a slice (a[arr(2,2), 1::2]) fell through to the non-general broadcast path and dropped the slice's output axis (NumPy (2,2,1) -> NumSharp (2,2)). Lowered to >=1; with the offset fix the grid now also subsumes the 1-D fancy + slice/ newaxis cases the np.take path mishandled (newaxis axis arithmetic, negative slices, non-contiguous sources all threw ArgumentOutOfRange/IndexOutOfRange on valid input). The getter no longer calls TryFetchSliceWithSingleAdvanced (now unreferenced; the whole Try* stack is removed in the final Phase D cleanup once the setter is migrated too). Impact (random sweep [0,5000)): divergences 123 -> 74; shapeDiff 41 -> 17, the np.take ArgumentOutOfRange bucket (20) eliminated. The remaining threw-on-valid are empty advanced indices (arr([])/barr([])) and 0-d-bool combos (Phase E). Tests: net8.0 1681 indexing/selection/slice + net10.0 1216 indexing, Index_Curated/Index_Dtype gate green on both.

…cy (Phase E) NumPy force-casts any size-0 index array to intp and treats it as an empty integer fancy index, never a boolean mask (mapping.c:425): A[np.array([], bool)] -> (0,4), not a 'boolean index did not match' length error. NumSharp routed an empty bool array through the boolean-mask path, which enforced length==axis and threw. Fixed in all three classification sites: - PrepareIndex (multi-index tuples): a size-0, ndim>=1 array becomes an empty FancyArr (bool cast to int64), consuming one axis with a 0-size block dim. - The single-index getter and setter NDArray paths (which bypass PrepareIndex): an empty array routes to the empty-fancy gather/scatter instead of BooleanMask. Empty INTEGER fancy was already correct (A[empty_int] -> (0,4), B[:,empty_int] -> (2,0,4)); this extends the same to empty bool. Random sweep [0,5000): divergences 74 -> 59, threw-on-valid 50 -> 35. Curated/Dtype gate + 1216 net8.0 indexing tests green.

… ValueError (Phase E) UnmanagedStorage.SetData(NDArray,int[]) treated ANY size-0 value as an unconditional no-op (added for np.pad's empty-axis assignment). But NumPy only no-ops when the TARGET region is also empty; assigning an empty array into a NON-empty region cannot broadcast and raises ValueError: A[()] = np.array([]) NumPy ValueError was NumSharp silent no-op A[:] = np.array([]) NumPy ValueError was NumSharp silent no-op Now guarded by the target subShape size: empty-into-empty still no-ops (np.pad preserved), empty-into-non-empty raises the NumPy 'could not broadcast input array from shape ... into shape ...' ValueError. Random setter region [7500,10000): divergences 185 -> 122, ns-accepted-invalid 89 -> 37 (and zero new threw-on-valid — the guard fires only where NumPy raises). Curated/Dtype gate + 1299 net8.0 indexing/selection tests green.

…_BOOL + fancy) (Phase E) A 0-d boolean index (np.array(True)/np.array(False), NumPy HAS_0D_BOOL) mixed with a fancy array used to crash the grid (np.nonzero on a 0-d bool is unsupported) or fall to the broadcast path. NumPy treats it as a length-1 (True) / length-0 (False) array that joins the advanced BLOCK broadcast but consumes NO source axis and adds no output dim of its own. TryBuildMultiAdvancedGrid now models it (new MixKind.ZeroBool): - classified before the k-d-mask case; contributes its (1,)/(0,) array to the block broadcast, consumes no source axis (axisOfItem = -1), counts toward block consecutiveness; - the grid fires when a slice/newaxis OR a 0-d bool is present (a 0-d bool can't use the broadcast path), with >=1 real fancy axis; - advBOf[] maps each block member to its broadcast slot so only real fancy axes get a per-axis index array, while the 0-d bool only shapes the block. Probed vs NumPy 2.4.2 (all exact): A[arr(2,1), True] -> (2,1,4) A[True, arr([2,2])] -> (2,4) V6[True, arr([1,2])] -> (2,) A[True, arr([0,1]),True] -> (2,4) A[arr([0,1]), True, 1] -> (2,) A[arr([0,1]), False] -> IndexError (block (2,) vs (0,)) (The False-mismatch IndexError is already raised up front by PrepareIndex's broadcast-together check.) Random GET sweep: 0-d-bool divergences 56 -> 13. Curated/Dtype gate + 1299 net8.0 indexing/selection tests green.

…l through to fancy) (Phase E) The setter's pure-basic branch — new NDArray(Storage.GetView(slices)).SetData(values, []) — had no after it (the getter's equivalent returns its view). So after correctly assigning through the slice view, control FELL THROUGH into the _NDArrayFound advanced-index label, which re-interpreted the same slices as fancy index arrays (GetIndicesFromSlice per axis) and tried to broadcast/scatter them. That re-interpretation threw on every pure-slice / ellipsis / newaxis assignment: A[...] = scalar -> ArgumentException 'Value cannot be an empty collection (indices)' (ellipsis builds an EMPTY advanced index list) A[:, :] = scalar -> IncorrectShapeException 'objects cannot be broadcast to a single shape' (slice index arrays (3,) and (4,) can't broadcast together) A[None] = arr([9]) -> ArgumentException (newaxis -> empty advanced list) Adding the missing makes the slice view assignment terminal, matching the getter. Random setter region [7000,10000): divergences 120 -> 62, threw-on-valid 68 -> 10 (the entire ArgumentException(34)+IncorrectShapeException(14) buckets cleared). Curated/Dtype gate + 1299 net8.0 indexing/selection tests green.

…g index (Phase E) ExpandEllipsisForMixed (used by the 0-d-bool and leading-mask basic handlers) counted a 0-d boolean toward the axis-consuming items, so the ellipsis under-filled by one and the inserted size-1/0 axis landed at the wrong output position: AT[..., True] on (4,3) NumPy (4,3,1) was NumSharp (4,1,3) ABC[..., False] on (3,4) NumPy (3,4,0) was NumSharp (3,0,4) A[..., False,0] on (3,4) NumPy (3,0) was NumSharp (0,4) A 0-d bool (HAS_0D_BOOL) consumes NO source axis, so it is now skipped in the ellipsis fill count alongside newaxis. Curated/Dtype gate + 1216 net8.0 indexing tests green.

…sts to selection (Phase E) The non-subshaped fancy setter branch (a[fancy] = value where the fancy indices cover every axis) wrote the value flat into the selected slots without checking it broadcasts to the indexing-result shape. A value of an incompatible shape was silently partial-written (or caught late by the memory-safety guard) instead of raising NumPy ValueError: V6[[2]] = [1,2,3,4,5] NumPy ValueError (value (5,) into selection (1,)) was silent/IOoR Now it materializes the value to a C-contiguous buffer of exactly retShape via np.broadcast_to (matching the subshaped branch), raising the NumPy 'could not broadcast input array from shape ... into shape ...' ValueError on mismatch; scalar and exactly-matching values keep their fast paths. Random setter region [7500,10000): divergences 52 -> 38, ns-accepted-invalid 39 -> 21 (the residual are empty-selection assignments, which short-circuit before this point). Curated/ Dtype gate + 1299 net8.0 indexing/selection tests green.

…memory safety) (Phase B/E) UnmanagedStorage.SetData(NDArray, int[]) did NOT wrap negative coordinates, unlike the getter's GetData(int[]) which calls Shape.InferNegativeCoordinates. So a negative single-index assignment reached via the object[] setter or the long[] coordinate shim wrote at buffer[-1]: b[(object)-1] = v wrote ONE ELEMENT BEFORE the buffer (OOB heap write), leaving the array unchanged (NumPy assigns the LAST element) This was the mixed-advanced sweep's flaky AccessViolation: a fresh np.arange(6) copy whose buffer[-1] write corrupted adjacent native-pool/GC memory, fatal only after enough accumulation (found by amplifying each divergent case in a tight loop until set/V6/rand/7047 = V6[-1]=scalar crashed). SetData now applies the same InferNegativeCoordinates wrap+bounds-check as GetData: the last element is assigned, and a genuinely out-of-range index raises NumPy's IndexError. Random sweep: divergences 93 -> 65; the divergent-case mini-corpus that crashed at this case now survives 4000x. Curated/Dtype gate + 1299 net8.0 indexing/selection tests green.

…nces, Phases A-E) Updates the combinatorial advanced-indexing handover with an Execution status section: Phases A-C done, D done, E mostly done; differential random sweep 697 -> 64 divergences (91%), curated/dtype gate 0/0, full CI 10980/0 on net8.0 + net10.0. Lists the landed commits per phase and the precise remaining work (setter value-broadcast on empty selections, multi-0-d-bool placement in TryBuild0dBoolWithBasic, multi-dim-fancy+ellipsis+int, a second layout-dependent teardown OOB, and the final Try* cleanup + Index_Random un-mark).

…-advanced (Phase D) TryBuildMultiAdvancedGrid bailed when no slice/newaxis/0-d-bool was present (hasExplicitBasic), sending pure-advanced tuples (fancy+int, fancy+fancy with no slice) to the old _NDArrayFound broadcast path. That path mis-placed a MULTI-DIM fancy combined with an int: AT[..., arr(4,1), -1] NumPy (4,1) was NumSharp (4,3) The grid is NumPy's general advanced-index algorithm (block broadcast + consec-aware placement), so it now fires for any tuple carrying an advanced block member (a fancy array, or a 0-d bool); only pure-basic tuples (slices/ints/newaxis, no block) fall through to the view path. Random sweep 64 -> 61; full indexing/selection suite 1299 net8.0 green.

…tinuation plan Amends the Execution status with a detailed, ordered Remaining work section (R1-R4) plus a Diagnostic tooling subsection, so the open items can be picked up directly: - R1 Setter value-broadcast on EMPTY selections (~25, largest): root cause is the SetIndices<T> empty short-circuits returning before retShape + value validation; fix restructures so retShape is computed first, value broadcast-validated (ValueError), then empty -> no-op; incl. the 0-d-bool-False branch. File:method:line, care/gate, expected delta. - R2 Multi / non-consecutive 0-d-bool placement: TryBuild0dBoolWithBasic lacks the consec rule; route to the grid (which has ZeroBool + consec) or delete it; unit-test the permutation. - R3 Second layout-dependent teardown OOB: writes past a PINNED MANAGED array (page-heap can't catch); re-check after R1/R2, else red-zone FromArray or loop_mini cross-case repetition. - R4 Final cleanup: delete the dead TryFetchSliceWithSingleAdvanced + getter/setter _NDArrayFound (the grid owns all HAS_FANCY now); un-mark Index_Random [OpenBugs] at 0/0. Also documents the scratchpad diagnostic harnesses (replay_index_jsonl / gchunt / loop_each / loop_mini, page-heap, mini-corpus build, the runfile-cache gotcha) and updates the totals to 697 -> 61 / commit range through 9c2e16b2.

…gnment (R1) NumPy requires an assigned value to broadcast to the indexing-RESULT shape even when that selection is empty (contains a 0) or is a single element — a value that cannot broadcast raises ValueError; it is NOT silently a no-op. NumSharp short- circuited three setter paths before validating, accepting what NumPy rejects. Random differential sweep (seed 20240626): the set-side "ns-accepted-invalid" bucket drops 25 -> 0 (61 -> ~36 total divergences). Curated/dtype gate stays 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0. Three sub-fixes, all the same NumPy rule (value broadcasts to selection-or-raise; a scalar always broadcasts): 1. 0-d-bool-False mixed with basic (Selection.Setter.cs). The any-False branch of TryBuild0dBoolWithBasic returned with no validation. Now it computes the empty selection shape (this[boolBasic].shape with [boolAxis]=0) and broadcasts the value there, raising the NumPy "shape mismatch: value array of shape X could not be broadcast to indexing result of shape Y" (no-space tuple) on mismatch. Covers e.g. set ANR[slice, b0(False)] = arr([4,75]) and the multi-0-d-bool forms [b0F,b0T,int] / [b0T,b0F,int] (one False -> length-0 block). 2. Scalar-element target (UnmanagedStorage.Setters.cs, scalar-to-scalar branch). When the coordinate consumes EVERY axis the target is a single element (shape ()), where NumPy requires a 0-d / scalar value: a 1+-D array — even size 1, e.g. a[3] = np.array([78]) or m[0,2] = np.array([94]) — raises "setting an array element with a sequence.", it is NOT unwrapped to its first element. The looser valueIsScalary (which also accepts a (1,) array) is correct only for the sub- array broadcast branch, not a single-element target; use valueshape.IsScalar. 3. Boolean-mask zero-select (Default.BooleanMask.cs, BooleanMaskSet). The trueCount==0 early return skipped value validation; arr[allFalseMask] = [93,1,39] into a (0,4) selection now raises the shape-mismatch ValueError. A scalar splat still no-ops. Pre-existing, unrelated failures confirmed against a clean baseline (in CI-excluded categories): HashHelpersLong_GetPrime/_ExpandPrime, Slice2x2Mul_AssignmentChanges- Original (np.arange int64 vs ToArray<int>), Broadcast_Sum_InternalError.

…ices Indexing a 0-dimensional array (e.g. np.array(5)) with ANY axis-consuming index -- an integer/boolean ARRAY, including an empty one (s[np.array([],int)]), or a raw int[]/long[] fancy index -- is "too many indices" in NumPy: a scalar has no axes to consume. NumSharp's SINGLE-index dispatch bypasses the PrepareIndex gate (which is keyed on indicesLen != 1), so these fell straight into the fancy gather and returned a bogus shape instead of raising. NumPy (mapping.c prepare_index): np.array(5)[np.array([],int)] -> IndexError "too many indices for array: array is 0-dimensional, but 1 were indexed" np.array(5)[np.array([-1,-1])] -> same np.array(5)[np.array([F])] -> same (a bool array consumes its ndim axes) np.array(5)[np.array(True)] -> OK (1,) (a 0-d bool consumes no axis) np.array(5)[np.newaxis] / [...] -> OK (no axis consumed) Fix: in the getter and setter single-index (indicesLen == 1) NDArray / int[] / long[] branches, when this.ndim == 0 raise the NumPy IndexError unless the index is a 0-d boolean (which adds a length-1/0 axis and is handled by the mask path below). The "N were indexed" count is the bool array's ndim (axes it expands to) else 1. Random differential sweep (seed 20240626): the get-side "ns-accepted-invalid" bucket (all base-S, the 0-d scalar) drops 12 -> 0. Curated/dtype gate stays 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0.

An advanced-index combo whose broadcast result is EMPTY (some advanced index has size 0) must yield an empty-shaped result; NumPy gathers nothing and never validates the un-accessed index values. NumSharp instead bounds-checked the sibling fancy values and rejected zero-length boolean masks, throwing on valid empty selections. Random differential sweep (seed 20240626): the get-side "ns-threw-on-valid" bucket drops ~13 -> 0 and 5 setter cases that share these paths also clear (set region [7000,10000) 8 -> 3). Curated/dtype gate stays 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0. Two NumPy-verified (2.4.2) rules: 1. Skip fancy-ARRAY value bounds when the advanced block is empty (PrepareIndex FinishPrepare). NumPy bounds-checks integer array values at GATHER time, so when the block broadcasts to a 0 nothing is gathered and out-of-range values are never seen: A[arr([-3,1,2,3],(4,1)), np.array([],int)] -> (4,0) and A[arr([99]), np.array(False)] -> (0,4) are valid (NOT IndexError). A SCALAR int index is still validated eagerly (A[np.array([],int), 9] still raises on the 9), as is a bool array's axis size; broadcastability is still checked, so an un-broadcastable empty combo ((2,2) with (0,)) still raises shape-mismatch. The block is empty iff a Fancy / 0-d-bool op is itself size 0. (Used `is not null` for op.Array: the NDArray == / != operators are element-wise, not reference.) 2. A zero-length boolean mask axis matches an array axis of ANY size (IsPartialShapeMatch). NumPy: A[np.zeros(0,bool)] -> (0,4) on a size-3 axis, A[np.zeros((3,0),bool)] -> (0,); only a NON-zero mask axis must equal the array axis (A[np.zeros((0,2),bool)] still raises on the size-2 axis). This fixes the leading-empty-mask-plus-basic combos (ACS[barr([],(0,)), 0] -> (0,), B[barr([],(0,)), -1] -> (0,4)) that routed through this[mask] and were rejected. BooleanMask already returns (0,)+trailing for a zero-true mask. Remaining (tracked): 3 non-consecutive 0-d-bool placement shapeDiffs (R2) and 3 setter throw-on-valid empty cases; plus the flaky teardown OOB (R3).

…-op, not a crash A pure-basic indexing assignment (slices / newaxis / ellipsis / scalar int) whose selection is EMPTY assigns nothing in NumPy — but the value must still broadcast to the empty target shape (a scalar/size-1 always does; an incompatible value raises ValueError). NumSharp routed the empty sliced view through NDIter.Copy, whose CreateCopyState indexes the first element of each operand and threw IndexOutOfRangeException on the 0-size view. NumPy (2.4.2): a[None, :0:3, 2] = np.array([15]) -> no-op, selection (1,0) a[-1:-4, -2:2:2, ...] = np.array([]) -> no-op, selection (0,0) a[:, ::2, ::-1, None] = np.array([42]) -> no-op, selection (1,0,2,1) a[:, 1:1] = np.array([1,2,3]) -> ValueError (value can't broadcast to (3,0)) Fix: in UnmanagedStorage.SetData(NDArray, int[]), the broadcasted/sliced branch now fetches the target view once and, when it is size 0, validates the value broadcasts to the target shape (NumPy "could not broadcast input array from shape X into shape Y" on mismatch — the basic-indexing message form) then returns without invoking the copy iterator. A scalar/size-1 value skips the check (always broadcasts). Random differential sweep (seed 20240626): set region [7000,10000) reaches 0 divergences (3 -> 0); the whole sweep is now down to 3 (the R2 0-d-bool placement shapeDiffs). Curated/dtype gate 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0.

… hash collision (R2) Two fixes; together they take the random differential sweep (seed 20240626) to 0 divergences across every measurable window (8700/10000; the [6700,7000) gap is the R3 teardown-OOB crash zone, tracked separately). Curated/dtype gate 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0. 1. Non-consecutive 0-d-bool axis placement (NDArray.Indexing.Selection.Getter.cs). When 0-d bools / ints in an advanced block are SEPARATED by a slice or newaxis, NumPy moves the merged advanced axis to the FRONT (_get_transpose); the in-place handler TryBuild0dBoolWithBasic can't express that. It now BAILS for a non-consecutive advanced block (the items span a slice/newaxis), routing to the grid which has the consec/front rule. e.g. a[int, slice, True] -> NumPy (1,0) was (0,1) a[slice, True, slice, False]-> NumPy (0,2,1) was (2,0,1) a[int, None, slice, False] -> NumPy (0,1,0) was (1,0,0) The grid also had to carry the block dims when the block is PURELY 0-d bools (no fancy array): nothing fed bshape into broadcast_arrays, so an any-False block kept the all-ones default length 1 instead of 0. The grid now stretches its index arrays to the already-correct outShape. 2. Shape.Broadcast short-circuited on a hash COLLISION (View/Shape.Broadcasting.cs). The shape hash collides a 0-length axis with a size-1 axis — (1,1) and (0,1) both hash to 26599 — and Broadcast returned the left shape unchanged whenever the hashes matched, so broadcast_to((1,1),(0,1)) wrongly yielded (1,1) instead of stretching the size-1 axis to the length-0 axis (NumPy: (0,1)). This surfaced in the pure-0-d-bool grid above and is a general broadcasting bug. The identical-shape fast path now confirms the dimensions are actually equal before short-circuiting; on a collision it falls through to the real broadcast (a tiny O(ndim) loop only when the hashes already match).

…ty bool mask Adds Indexing.CombinatorialParity.MatrixTests.cs — a [FuzzMatrix] CI gate that pins the five mapping.c-parity buckets fixed this session (R1 value-broadcast on empty/ scalar selections, B2 0-d-base over-index, B3 empty advanced gather, B4 empty-slice assignment no-op, R2 non-consecutive 0-d-bool placement). Every shape/value/raise was probed against NumPy 2.4.2. These cases come from the seeded random sweep (index_random_20240626.jsonl), which stays [OpenBugs] only behind the flaky teardown OOB (handover R3) — so the now-passing forms are gated here independently. Writing the tests surfaced one more divergence (NOT in the random corpus, which only had 1-D empty masks): a MULTI-DIMENSIONAL empty boolean mask was routed as a single empty integer fancy index, giving the wrong rank. NumPy treats an empty bool array as a MASK that consumes mask.ndim axes via its nonzero: A[np.zeros((3,0), bool)] -> NumPy (0,) NumSharp was (3,0,4) A[np.zeros(0, bool)] -> (0,4) (the 1-D case matched only by coincidence) Fix: in the getter and setter single-index dispatch, exclude boolean arrays from the size-0 empty-fancy routing so every empty bool mask falls to the mask path (which, since the IsPartialShapeMatch zero-axis fix, returns (0,)+trailing correctly). Full suite 11003 pass / 0 fail on net8.0 AND net10.0 (10980 + 23 new); random sweep stays 0 divergences across all measurable windows.

…e open item Updates the combinatorial-indexing handover to the executed reality: all five divergence buckets (R1 value-broadcast on empty/scalar selections, B2 0-d-base over-index, B3 empty advanced gather, B4 empty-slice assignment, R2 non-consecutive 0-d-bool placement + the Shape.Broadcast hash collision it uncovered) are fixed, committed (aea9fc78..7e968f5e), and pinned by the new Indexing.CombinatorialParity [FuzzMatrix] gate. The random sweep is 0 divergences across every measurable window. The only open item is R3, a pre-existing flaky teardown heap-corruption. Records the diagnostics gathered this session that supersede the prior "pinned managed" guess: it is a specific corpus shape (not allocation volume), it is NOT a pooled native-buffer overrun (end-aligned guard pages run clean), and it resists deterministic repro (crash point varies 6285-9879; ~1/3 even in the tightest window). Next-session plan: a per-case red-zone on FromArray / the direct-VirtualAlloc zeroed path, or a dotnet-dump capture. Index_Random stays [OpenBugs] purely on the crash, not parity.

Mechanical, behavior-preserving rename of NumSharp's iterator/expression stack from the `Npy*` prefix to `ND*`. ND* matches NumSharp's house style (`NDArray` ↔ numpy.ndarray, the retired NDIterator) and NumPy's Python `nditer` name. The old `Npy*` prefix mirrored NumPy's C struct name (`NpyIter`); `ND*` is the user-facing convention used everywhere else. Scope (NumSharp-owned only): - Types/delegates/enums: NpyIter→NDIter, NpyIterRef→NDIterRef, NpyExpr→NDExpr, NpyIterState/Flags/PerOpFlags/GlobalFlags/OpFlags, NpyFlatIterator→NDFlatIterator, NpyAxisIter→NDAxisIter, NpyMemOverlap→NDMemOverlap, reduction-kernel structs + interfaces (INpy…→IND…), NpyArrayMethodFlags→NDArrayMethodFlags, and the NpyIter_* C-API-style method names → NDIter_*. - Utilities: NpyComplexMath→NDComplexMath, NpyDivision→NDDivision, NpyIntegerPower→NDIntegerPower. - Benchmark subsystem: benchmark/npyiter → benchmark/nditer (npyiter_{bench,sheet,cards,results,headline} → nditer_*, --skip-npyiter → --skip-nditer). - 65 files renamed via git mv; ~190 files content-swept; website docs, docs/numpy notes, and frozen benchmark/history snapshots included. Preserved (genuine NumPy references, NOT the stack): - src/numpy/** (the upstream clone — NpyIter is NumPy's real C type). - The .npy/.npz file format: `#region NpyFormat` (np.save/np.load) and the SaveAndLoadWithNpyFileExt test. - NumPy's C function names quoted in docs (npyiter_allocate_arrays, npyiter_coalesce_axes, … kept verbatim). Build: solution green (0 errors). Tests: 10980 passed, 0 failed, 11 skipped (net10.0, CI filter TestCategory!=OpenBugs&!=HighMemory). Branch commit messages were rewritten Npy→ND separately (message-only history rewrite; file blobs in historical commits untouched). This commit is registered in .git-blame-ignore-revs as a mechanical rename.

…→ND rename 1. Expand all entries to full 40-char SHAs — git blame --ignore-revs-file ABORTS on abbreviated names ("fatal: invalid object name: ac02033"). 2. Register the mechanical Npy→ND rename commit (301229b).

…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NDIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NDIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).

Nucs force-pushed the nditer branch from f5c05a7 to 574a0d8 Compare April 23, 2026 09:34

Nucs mentioned this pull request Apr 28, 2026

Add NDIterator<T> overload with support for specific axis. #363

Open

Nucs mentioned this pull request May 17, 2026

[Core] Layout 'F/A/K' support #610

Open

8 tasks

Nucs added 17 commits June 13, 2026 17:26

Nucs added 6 commits June 13, 2026 23:13

Nucs added 28 commits June 27, 2026 10:12

Nucs force-pushed the nditer branch from 2d0fc6b to d08c296 Compare June 27, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 558 commits into
masterfrom
nditer

Nucs commented Apr 22, 2026 •

edited

Loading

Uh oh!

Nucs commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Nucs commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

1. NpyIter — full NumPy nditer port

Execution at NumPy speed

2. NpyExpr DSL + three-tier custom-op API

3. Legacy iterator stack retired

4. C/F/A/K memory-layout support

5. New & completed np.* APIs

6. Linear algebra

7. Performance (beyond NpyIter and linalg)

8. Official benchmark suite + honest methodology

9. Differential fuzzing vs NumPy (new infrastructure)

10. Correctness — NumPy-parity bug fixes

11. Memory management — ARC + IDisposable

12. Char8 primitive

13. Examples — trainable MNIST MLP

14. Kernel architecture & hygiene

15. Documentation

16. Tests & CI

Breaking changes

Uh oh!

Nucs commented Jun 5, 2026

📊 Benchmark & performance — nditer

1. Fused strided-SIMD unary IL kernel (d01f1d63)

2. Official NumSharp-vs-NumPy benchmark (6038990f)

Reproducibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Nucs commented Apr 22, 2026 •

edited

Loading

1. NpyIter — full NumPy `nditer` port

5. New & completed `np.*` APIs

11. Memory management — ARC + `IDisposable`

12. `Char8` primitive

📊 Benchmark & performance — `nditer`

1. Fused strided-SIMD unary IL kernel (`d01f1d63`)

2. Official NumSharp-vs-NumPy benchmark (`6038990f`)